Perplexity (Quality of Generation) Scores #406

Green-Sky · 2023-03-22T18:54:23Z

Green-Sky
Mar 22, 2023
Collaborator

We are currently collecting Perplexity scores for all models + quantization + program flags. Use this discussion to Coordinate.

Mostly Default `./perplexity` settings with all of `wiki.test.raw`

Results in italics are now being added / updated with BLAS enabled and using quantization as per PR #896. These results are collected from various sources and builds, so will contain inconsistencies and errors.

chunks	model	f16	q4_0	q4_1
655	7B	5.9565	6.2644	6.0863
655	13B	5.2455	5.4067	5.3557
655	30B	4.1549	4.2929	4.2701
655	65B	*3.5392*	3.6902	3.6188

Note: Since the tokenizer used by FB Llama and Open Llama are different the following is not a valid inter-model comparison ~ @gjmulder:

chunks	model	f16	q8_0	q4_0	q4_1	q5_0
655	FaceBook llama 7B	5.9565		6.2644	6.0863
655	Open Llama 7B (700M tokens)	7.3889	7.3904	7.5423	7.5078	7.4226

Context sizes: `(512 | 1024 | 2048) ⨯ (7B | 13B | 30B | 65B) ⨯ (llama | alpaca[-lora] | vicuna-GPTQ)` models, first 406 lines of `wiki.test.raw`:

Google GSheet with comments enabled.

I appreciate that alpaca models aren't generative in intent, and so perplexity is not a good measure. However, I was curious to see the trade-off in perplexity for the chat-like models - @gjmulder

History

Prior results in readme: https://github.com/ggerganov/llama.cpp#perplexity-measuring-model-quality
more or less a continuation of Compute perplexity over prompt #270 Add details on perplexity to README.md #395 Quantitative measurement of model perplexity for different models and model quantization modes #129

Feel free to make a new thread in this discussion when you take a measurement, or want to "donate" some compute time.

(@gjmulder @glinscott, et al feel free to make edits to this post)

Green-Sky · 2023-03-22T19:06:10Z

Green-Sky
Mar 22, 2023
Collaborator Author

perplexity run post c-api refactor merge to verify
7B q4_0

result: 6.5990

$ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -t 10 -f wikitext-2-raw/wiki.test.raw

main: seed = 1679485635
llama_model_load: loading model from 'models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 10 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks
39.69 seconds per pass - ETA 7.22 hours
[1]4.7730,[2]5.3123,[3]6.1950,[4]6.8800,[5]6.9340,[6]6.9041,[7]7.0887,[8]7.1750,[9]7.5810,[10]7.8213,[11]8.0918,[12]8.1240,[13]8.0188,[14]8.1072,[15]8.3734,[16]7.9572,[17]7.8212,[18]7.7749,[19]7.3728,[20]7.3484,[21]7.2392,[22]7.0475,[23]7.0023,[24]6.9124,[25]6.9157,[26]6.7310,[27]6.5488,[28]6.4421,[29]6.3525,[30]6.1815,[31]6.1584,[32]6.1738,[33]6.1086,[34]6.1508,[35]6.1759,[36]6.2196,[37]6.2240,[38]6.2366,[39]6.2772,[40]6.3356,[41]6.3409,[42]6.3819,[43]6.3353,[44]6.3968,[45]6.3978,[46]6.3696,[47]6.3961,[48]6.3616,[49]6.3713,[50]6.3279,[51]6.3202,[52]6.3068,[53]6.3542,[54]6.3337,[55]6.3089,[56]6.3478,[57]6.3711,[58]6.3966,[59]6.4139,[60]6.4583,[61]6.4456,[62]6.5124,[63]6.5496,[64]6.5657,[65]6.6125,[66]6.6242,[67]6.6417,[68]6.6580,[69]6.6845,[70]6.7206,[71]6.7451,[72]6.7771,[73]6.8455,[74]6.8530,[75]6.8709,[76]6.8899,[77]6.9010,[78]6.8860,[79]6.9156,[80]6.9079,[81]6.9276,[82]6.9342,[83]6.8771,[84]6.8609,[85]6.8505,[86]6.8257,[87]6.7631,[88]6.7328,[89]6.7128,[90]6.6954,[91]6.7234,[92]6.7190,[93]6.7192,[94]6.7180,[95]6.7474,[96]6.7451,[97]6.7365,[98]6.7287,[99]6.7128,[100]6.7129,[101]6.7378,[102]6.7325,[103]6.7550,[104]6.7601,[105]6.7609,[106]6.7791,[107]6.7803,[108]6.7929,[109]6.7852,[110]6.7785,[111]6.8001,[112]6.8214,[113]6.8229,[114]6.8201,[115]6.8282,[116]6.8219,[117]6.8257,[118]6.8552,[119]6.8773,[120]6.9141,[121]6.9306,[122]6.9567,[123]6.9993,[124]7.0159,[125]7.0075,[126]7.0482,[127]7.0856,[128]7.1132,[129]7.0947,[130]7.1030,[131]7.0973,[132]7.0875,[133]7.0743,[134]7.0831,[135]7.0802,[136]7.0663,[137]7.0581,[138]7.0421,[139]7.0313,[140]7.0281,[141]6.9997,[142]6.9942,[143]6.9683,[144]6.9495,[145]6.9391,[146]6.9246,[147]6.9324,[148]6.9340,[149]6.9280,[150]6.9250,[151]6.9269,[152]6.9180,[153]6.9016,[154]6.8929,[155]6.8992,[156]6.8932,[157]6.9118,[158]6.9151,[159]6.9203,[160]6.9231,[161]6.9358,[162]6.9030,[163]6.8905,[164]6.8622,[165]6.8277,[166]6.7969,[167]6.7559,[168]6.7220,[169]6.7063,[170]6.6933,[171]6.6633,[172]6.6442,[173]6.6245,[174]6.5936,[175]6.5687,[176]6.5562,[177]6.5346,[178]6.5103,[179]6.4922,[180]6.4818,[181]6.4575,[182]6.4374,[183]6.4219,[184]6.4212,[185]6.4129,[186]6.4145,[187]6.4192,[188]6.4146,[189]6.4342,[190]6.4366,[191]6.4576,[192]6.4741,[193]6.4938,[194]6.5070,[195]6.5298,[196]6.5474,[197]6.5713,[198]6.5878,[199]6.5910,[200]6.5960,[201]6.5926,[202]6.6135,[203]6.6211,[204]6.6234,[205]6.6354,[206]6.6433,[207]6.6387,[208]6.6475,[209]6.6517,[210]6.6577,[211]6.6678,[212]6.6756,[213]6.6863,[214]6.6918,[215]6.6948,[216]6.7098,[217]6.7275,[218]6.7411,[219]6.7416,[220]6.7381,[221]6.7334,[222]6.7296,[223]6.7178,[224]6.7116,[225]6.7066,[226]6.7277,[227]6.7371,[228]6.7431,[229]6.7511,[230]6.7463,[231]6.7632,[232]6.7500,[233]6.7311,[234]6.7142,[235]6.7002,[236]6.6919,[237]6.6804,[238]6.6835,[239]6.6663,[240]6.6553,[241]6.6590,[242]6.6631,[243]6.6612,[244]6.6492,[245]6.6459,[246]6.6331,[247]6.6195,[248]6.6118,[249]6.6100,[250]6.6149,[251]6.6067,[252]6.6033,[253]6.5925,[254]6.5885,[255]6.5755,[256]6.5549,[257]6.5423,[258]6.5344,[259]6.5319,[260]6.5233,[261]6.5203,[262]6.5147,[263]6.5085,[264]6.4916,[265]6.4907,[266]6.4900,[267]6.4837,[268]6.4936,[269]6.4907,[270]6.4911,[271]6.4993,[272]6.5050,[273]6.5044,[274]6.5057,[275]6.5155,[276]6.5212,[277]6.5385,[278]6.5506,[279]6.5588,[280]6.5630,[281]6.5738,[282]6.5796,[283]6.5952,[284]6.6023,[285]6.6118,[286]6.6258,[287]6.6247,[288]6.6312,[289]6.6220,[290]6.6066,[291]6.5905,[292]6.5739,[293]6.5600,[294]6.5624,[295]6.5608,[296]6.5662,[297]6.5648,[298]6.5679,[299]6.5661,[300]6.5546,[301]6.5543,[302]6.5460,[303]6.5368,[304]6.5266,[305]6.5245,[306]6.5109,[307]6.5128,[308]6.5164,[309]6.4991,[310]6.4925,[311]6.4862,[312]6.4879,[313]6.4827,[314]6.4817,[315]6.4640,[316]6.4601,[317]6.4424,[318]6.4199,[319]6.4327,[320]6.4471,[321]6.4514,[322]6.4462,[323]6.4398,[324]6.4374,[325]6.4485,[326]6.4485,[327]6.4515,[328]6.4555,[329]6.4621,[330]6.4639,[331]6.4768,[332]6.4733,[333]6.4809,[334]6.4751,[335]6.4686,[336]6.4723,[337]6.4685,[338]6.4665,[339]6.4607,[340]6.4561,[341]6.4646,[342]6.4674,[343]6.4723,[344]6.4724,[345]6.4723,[346]6.4687,[347]6.4731,[348]6.4769,[349]6.4786,[350]6.4749,[351]6.4757,[352]6.4754,[353]6.4702,[354]6.4712,[355]6.4763,[356]6.4799,[357]6.4764,[358]6.4855,[359]6.4888,[360]6.4849,[361]6.4849,[362]6.4919,[363]6.5036,[364]6.5110,[365]6.5180,[366]6.5193,[367]6.5280,[368]6.5255,[369]6.5266,[370]6.5283,[371]6.5218,[372]6.5270,[373]6.5323,[374]6.5305,[375]6.5292,[376]6.5380,[377]6.5318,[378]6.5342,[379]6.5408,[380]6.5315,[381]6.5268,[382]6.5207,[383]6.5194,[384]6.5183,[385]6.5177,[386]6.5170,[387]6.5161,[388]6.5112,[389]6.5050,[390]6.4977,[391]6.4895,[392]6.4850,[393]6.4836,[394]6.4862,[395]6.4842,[396]6.4765,[397]6.4835,[398]6.4874,[399]6.4961,[400]6.4956,[401]6.4968,[402]6.4976,[403]6.4993,[404]6.5063,[405]6.4978,[406]6.4941,[407]6.4938,[408]6.4948,[409]6.5076,[410]6.5197,[411]6.5318,[412]6.5486,[413]6.5601,[414]6.5673,[415]6.5727,[416]6.5817,[417]6.5948,[418]6.5986,[419]6.6061,[420]6.6153,[421]6.6276,[422]6.6330,[423]6.6405,[424]6.6534,[425]6.6635,[426]6.6709,[427]6.6760,[428]6.6849,[429]6.6894,[430]6.6989,[431]6.7144,[432]6.7182,[433]6.7167,[434]6.7115,[435]6.7122,[436]6.7145,[437]6.7247,[438]6.7328,[439]6.7294,[440]6.7283,[441]6.7228,[442]6.7216,[443]6.7230,[444]6.7236,[445]6.7213,[446]6.7240,[447]6.7271,[448]6.7315,[449]6.7288,[450]6.7292,[451]6.7247,[452]6.7148,[453]6.7058,[454]6.6996,[455]6.7003,[456]6.7054,[457]6.7073,[458]6.7049,[459]6.7061,[460]6.7150,[461]6.7117,[462]6.7099,[463]6.7150,[464]6.7142,[465]6.7112,[466]6.7030,[467]6.7037,[468]6.7041,[469]6.7067,[470]6.7073,[471]6.7022,[472]6.7076,[473]6.7013,[474]6.7036,[475]6.6978,[476]6.7000,[477]6.6932,[478]6.6928,[479]6.6998,[480]6.7051,[481]6.7071,[482]6.7021,[483]6.6975,[484]6.7006,[485]6.6993,[486]6.6937,[487]6.6935,[488]6.6914,[489]6.6861,[490]6.6831,[491]6.6795,[492]6.6736,[493]6.6710,[494]6.6691,[495]6.6689,[496]6.6656,[497]6.6607,[498]6.6585,[499]6.6531,[500]6.6429,[501]6.6359,[502]6.6357,[503]6.6350,[504]6.6254,[505]6.6281,[506]6.6291,[507]6.6240,[508]6.6192,[509]6.6181,[510]6.6222,[511]6.6272,[512]6.6309,[513]6.6327,[514]6.6398,[515]6.6343,[516]6.6338,[517]6.6343,[518]6.6347,[519]6.6378,[520]6.6408,[521]6.6426,[522]6.6454,[523]6.6460,[524]6.6527,[525]6.6563,[526]6.6578,[527]6.6601,[528]6.6550,[529]6.6555,[530]6.6496,[531]6.6480,[532]6.6535,[533]6.6557,[534]6.6533,[535]6.6558,[536]6.6507,[537]6.6482,[538]6.6529,[539]6.6540,[540]6.6580,[541]6.6594,[542]6.6601,[543]6.6612,[544]6.6626,[545]6.6603,[546]6.6607,[547]6.6562,[548]6.6503,[549]6.6500,[550]6.6474,[551]6.6433,[552]6.6414,[553]6.6368,[554]6.6342,[555]6.6313,[556]6.6305,[557]6.6333,[558]6.6293,[559]6.6290,[560]6.6284,[561]6.6282,[562]6.6261,[563]6.6263,[564]6.6308,[565]6.6329,[566]6.6326,[567]6.6303,[568]6.6307,[569]6.6286,[570]6.6312,[571]6.6319,[572]6.6329,[573]6.6328,[574]6.6293,[575]6.6292,[576]6.6289,[577]6.6275,[578]6.6255,[579]6.6267,[580]6.6197,[581]6.6156,[582]6.6147,[583]6.6154,[584]6.6157,[585]6.6077,[586]6.6006,[587]6.6007,[588]6.6062,[589]6.6120,[590]6.6153,[591]6.6172,[592]6.6159,[593]6.6125,[594]6.6133,[595]6.6108,[596]6.6154,[597]6.6129,[598]6.6090,[599]6.6113,[600]6.6109,[601]6.6094,[602]6.6117,[603]6.6148,[604]6.6160,[605]6.6196,[606]6.6219,[607]6.6204,[608]6.6163,[609]6.6167,[610]6.6205,[611]6.6184,[612]6.6210,[613]6.6172,[614]6.6119,[615]6.6037,[616]6.6069,[617]6.6005,[618]6.5948,[619]6.5888,[620]6.5735,[621]6.5660,[622]6.5643,[623]6.5659,[624]6.5665,[625]6.5665,[626]6.5651,[627]6.5678,[628]6.5682,[629]6.5679,[630]6.5713,[631]6.5779,[632]6.5834,[633]6.5816,[634]6.5848,[635]6.5856,[636]6.5827,[637]6.5790,[638]6.5820,[639]6.5789,[640]6.5800,[641]6.5805,[642]6.5875,[643]6.5898,[644]6.5908,[645]6.5881,[646]6.5929,[647]6.5893,[648]6.5903,[649]6.5906,[650]6.5948,[651]6.6007,[652]6.6015,[653]6.6062,[654]6.5994,[655]6.5990

6 replies

glinscott Mar 22, 2023
Collaborator

32 threads. let me show the full output, it matches my old output exactly (so maybe per machine delta?). One notable difference is AVX512.

$ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 512
main: seed = 1679515634
llama_model_load: loading model from 'models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 32 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
24.58 seconds per pass - ETA 4.47 hours
[1]4.5970,[2]5.1807,

glinscott Mar 22, 2023
Collaborator

Huh, I tried with 10 threads, and got exactly your output. So seems like there is something up there.

[1]4.7730,[2]5.3123,[3]6.1950,

Green-Sky Mar 22, 2023
Collaborator Author

oh gad, more dimension space.

One notable difference is AVX512

your log states: AVX512 = 0

glinscott Mar 22, 2023
Collaborator

oh my, i can't read. yes, so AVX512 matches :).

glinscott Mar 23, 2023
Collaborator

Good-ish news. I did a complete 10 thread run of 7B, q4_0 and it resulted in 6.5990. So while there is some impact, it was pretty small over the entire dataset.

BadisG · 2023-03-22T22:46:19Z

BadisG
Mar 22, 2023

Be sure you know exactly what kind of GPTQ quantization you have because the old .pt files don't have the "groupe size 128" if I remember well
https://github.com/qwopqwop200/GPTQ-for-LLaMa

0 replies

blackhole89 · 2023-03-23T01:13:25Z

blackhole89
Mar 23, 2023

I've been running some perplexity tests on my Q4_1 acceleration fork (see also this post), and noticed that the scores for the first few batches were worse. On 7B, it made a difference of as much as 0.1 at some point (usually closer to 0.05). I then swapped around the order of the two acc = _mm256_fmadd_ps(..., ..., acc) additions, and the result got a bit (but not fully) better on some data but still worse on others. (The only other difference is that the constant offset accumulator in my fork is implemented with AVX FMA instructions with different rounding behaviour.)

I suspect that we are dealing with numerical issues here. Indeed, with some debug output, it becomes clear that for some tensors the accumulator hits in excess of 10^4 magnitude while several blocks of summands are 10^{-3}. At float32 precision, the mantissa has about 7 significant decimal digits, so we're clearly actually hitting the spot where arbitrarily many summands could be functionally ignored.

On this hunch, I tried to do a simple stability-improving transformation where I cut the loop computing the dot product in half, summed the first and second halves separately and then finally summed them together. This produced what I think is approximately the best-looking 7B wikitext block scores yet:

[1]4.5225,[2]4.9974,[3]5.8552,[4]6.4904,[5]6.6052

(To compare, Q4_1 in master is cited as [1]4.4880,[2]4.9980,[3]5.9143.)

I think this is a problem we should take seriously, considering that these discrepancies are only an order of magnitude or so off from the perplexity benefits of Q4_1 vs. Q4_0. I don't know if the current split in two is optimal, or we can do better with, say, splitting in four. In fact, how do "professional" and GPU matrix multiplication solutions handle this? Implementing a dot product as a linear loop accumulation seems bound to run into this problem; naively you probably want something closer to a binary tree for reducing your sum.

3 replies

glinscott Mar 23, 2023
Collaborator

https://forums.developer.nvidia.com/t/how-to-improve-float-array-summation-precision-and-stability/67904 has some good details on accurate summation. Kahan summation looks like it might be fairly easy to try.

blackhole89 Mar 23, 2023

Thanks for the pointer. I tried implementing Kahan summation, but the result was actually worse (though I can't rule out that I screwed up something):

[1]4.6048,[2]5.0737,[3]5.9639,[4]6.5621,[5]6.6616,

It also had a ~50% performance overhead, i.e. more than negates the cost of just staying with the implementation in master. I do really wonder how whatever LLaMa was trained on implemented this. It seems quite possible that ultimately the training "worked around" whatever numerical quirks there were (e.g. that small weights towards the tail end of an array tend to get rounded away), and not reproducing them exactly results in worse performance even if it's more correct.

As a sanity check, here's what comes out when doing Kahan sums wrong (flipping the sign on the error compensation terms):

[1]4.6383,[2]5.0922,[3]5.9708,

i.e., barely worse.

Would anyone with better CPU/memory than me be willing to try running the entire eval for the various solutions on 7B Q4_1, to make sure we are not chasing local fluctuations? My estimated runtime for each of them is on the order of 30 hours. I've pushed in the baseline accelerated branch (q4_1_more_accel), one with Kahan summation (q4_1_more_accel_kahan) and one with a tuneable loop splitting parameter (defaults to 2) (q4_1_more_accel_loopsplit).

blackhole89 Mar 23, 2023

...on the other hand, a simple regrouping of terms at the cost of one extra vector operation (namely: calculate acc += (d0 d1 x y + d0 m1 x + m0 d1 y) instead of acc += d0d1xy; acc += d0m1x + m0d1y)) seems to be doing significantly better:

[1]4.3819,[2]4.9055,[3]5.8298,[4]6.4572

We might really need better data than the first few chunks, but it does make sense for this to help (if m0/m1 are negative, then the two terms may cancel to a significant extent).

BadisG · 2023-03-23T10:01:23Z

BadisG
Mar 23, 2023

Guys I find someone who already did the tests https://github.com/IST-DASLab/gptq

0 replies

jasontitus · 2023-03-23T14:20:02Z

jasontitus
Mar 23, 2023

OK, got the numbers for 65B q4_1 - 3.6188.

Full run info (from a M1 Ultra):

system_info: n_threads = 8 / 20 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks
75.52 seconds per pass - ETA 13.74 hours
[1]3.0734,[2]3.4066,[3]4.0188,[4]3.7782,[5]3.6117,[6]3.5366,[7]3.6606,[8]3.7595,[9]3.8370,[10]3.8614,[11]3.8457,[12]3.8366,[13]3.8574, [14]3.9793,[15]4.1297,[16]4.0007,[17]4.0228,[18]4.0730,[19]3.9616,[20]3.9897,[21]3.9688,[22]3.8615,[23]3.8663,[24]3.8370,[25]3.8302,[26]3.7230,[27]3.6268,[28]3.5600,[29]3.4959,[30]3.4014,[31]3.3272,[32]3.3381,[33]3.3071,[34]3.3261,[35]3.3438,[36]3.3694,[37]3.3676,[38]3.3609,[39]3.3768,[40]3.4001,[41]3.4258,[42]3.4519,[43]3.4503,[44]3.4964,[45]3.5201,[46]3.5217,[47]3.5519,[48]3.5601,[49]3.5756,[50]3.5652,[51]3.5865,[52]3.6001,[53]3.6443,[54]3.6531,[55]3.6273,[56]3.5609,[57]3.5173,[58]3.5222,[59]3.5437,[60]3.5681,[61]3.5719,[62]3.6104,[63]3.6249,[64]3.6340,[65]3.6543,[66]3.6499,[67]3.6623,[68]3.6759,[69]3.6967,[70]3.7211,[71]3.7410,[72]3.7699,[73]3.7986,[74]3.8078,[75]3.8150,[76]3.8218,[77]3.8230,[78]3.8082,[79]3.8300,[80]3.8092,[81]3.8230,[82]3.7937,[83]3.7673,[84]3.7580,[85]3.7609,[86]3.7593,[87]3.7198,[88]3.6732,[89]3.6428,[90]3.6205,[91]3.6054,[92]3.5932,[93]3.5809,[94]3.5684,[95]3.5753,[96]3.5711,[97]3.5593,[98]3.5547,[99]3.5462,[100]3.5449,[101]3.5445,[102]3.5304,[103]3.5271,[104]3.5242,[105]3.5195,[106]3.5159,[107]3.5084,[108]3.5114,[109]3.5053,[110]3.4971,[111]3.5063,[112]3.5138,[113]3.5072,[114]3.4963,[115]3.4896,[116]3.4863,[117]3.4829,[118]3.5051,[119]3.5251,[120]3.5520,[121]3.5640,[122]3.5850,[123]3.6153,[124]3.6331,[125]3.6383,[126]3.6693,[127]3.6974,[128]3.7208,[129]3.7058,[130]3.7156,[131]3.7211,[132]3.7236,[133]3.7205,[134]3.7313,[135]3.7397,[136]3.7420,[137]3.7480,[138]3.7448,[139]3.7470,[140]3.7496,[141]3.7372,[142]3.7409,[143]3.7307,[144]3.7262,[145]3.7228,[146]3.7220,[147]3.7309,[148]3.7377,[149]3.7399,[150]3.7453,[151]3.7541,[152]3.7571,[153]3.7551,[154]3.7577,[155]3.7669,[156]3.7709,[157]3.7849,[158]3.7883,[159]3.7935,[160]3.8042,[161]3.8162,[162]3.8054,[163]3.8033,[164]3.7938,[165]3.7825,[166]3.7712,[167]3.7563,[168]3.7350,[169]3.7226,[170]3.7201,[171]3.7089,[172]3.7050,[173]3.6996,[174]3.6858,[175]3.6746,[176]3.6706,[177]3.6635,[178]3.6545,[179]3.6501,[180]3.6492,[181]3.6417,[182]3.6339,[183]3.6309,[184]3.6345,[185]3.6287,[186]3.6291,[187]3.6377,[188]3.6327,[189]3.6445,[190]3.6479,[191]3.6646,[192]3.6760,[193]3.6870,[194]3.6941,[195]3.7111,[196]3.7236,[197]3.7408,[198]3.7529,[199]3.7588,[200]3.7403,[201]3.7223,[202]3.7133,[203]3.7032,[204]3.6876,[205]3.6831,[206]3.6752,[207]3.6701,[208]3.6693,[209]3.6716,[210]3.6802,[211]3.6912,[212]3.6982,[213]3.7098,[214]3.7168,[215]3.7217,[216]3.7336,[217]3.7475,[218]3.7599,[219]3.7642,[220]3.7667,[221]3.7662,[222]3.7694,[223]3.7682,[224]3.7667,[225]3.7671,[226]3.7811,[227]3.7845,[228]3.7932,[229]3.8007,[230]3.8013,[231]3.8141,[232]3.8114,[233]3.8067,[234]3.8005,[235]3.7877,[236]3.7859,[237]3.7828,[238]3.7886,[239]3.7854,[240]3.7834,[241]3.7863,[242]3.7909,[243]3.7940,[244]3.7915,[245]3.7917,[246]3.7882,[247]3.7854,[248]3.7849,[249]3.7858,[250]3.7920,[251]3.7899,[252]3.7911,[253]3.7865,[254]3.7845,[255]3.7822,[256]3.7745,[257]3.7702,[258]3.7675,[259]3.7688,[260]3.7648,[261]3.7634,[262]3.7628,[263]3.7615,[264]3.7471,[265]3.7494,[266]3.7485,[267]3.7463,[268]3.7508,[269]3.7510,[270]3.7544,[271]3.7613,[272]3.7628,[273]3.7636,[274]3.7666,[275]3.7721,[276]3.7753,[277]3.7860,[278]3.7942,[279]3.8019,[280]3.8057,[281]3.8135,[282]3.8202,[283]3.8316,[284]3.8405,[285]3.8500,[286]3.8606,[287]3.8568,[288]3.8638,[289]3.8621,[290]3.8486,[291]3.8349,[292]3.8225,[293]3.8120,[294]3.8069,[295]3.8087,[296]3.8089,[297]3.8077,[298]3.8054,[299]3.7996,[300]3.7903,[301]3.7826,[302]3.7758,[303]3.7697,[304]3.7631,[305]3.7586,[306]3.7517,[307]3.7482,[308]3.7454,[309]3.7371,[310]3.7316,[311]3.7275,[312]3.7238,[313]3.7206,[314]3.7178,[315]3.7086,[316]3.7034,[317]3.6957,[318]3.6848,[319]3.6928,[320]3.7023,[321]3.7072,[322]3.7071,[323]3.7028,[324]3.7025,[325]3.7081,[326]3.7116,[327]3.7131,[328]3.7159,[329]3.7194,[330]3.7203,[331]3.7286,[332]3.7280,[333]3.7342,[334]3.7312,[335]3.7306,[336]3.7340,[337]3.7357,[338]3.7367,[339]3.7363,[340]3.7364,[341]3.7425,[342]3.7471,[343]3.7518,[344]3.7537,[345]3.7571,[346]3.7587,[347]3.7626,[348]3.7677,[349]3.7722,[350]3.7727,[351]3.7749,[352]3.7766,[353]3.7753,[354]3.7755,[355]3.7777,[356]3.7816,[357]3.7805,[358]3.7886,[359]3.7881,[360]3.7873,[361]3.7884,[362]3.7935,[363]3.8012,[364]3.8054,[365]3.8091,[366]3.8115,[367]3.8189,[368]3.8196,[369]3.8218,[370]3.8254,[371]3.8245,[372]3.8301,[373]3.8340,[374]3.8349,[375]3.8358,[376]3.8414,[377]3.8399,[378]3.8427,[379]3.8461,[380]3.8426,[381]3.8412,[382]3.8383,[383]3.8381,[384]3.8400,[385]3.8396,[386]3.8396,[387]3.8409,[388]3.8385,[389]3.8361,[390]3.8331,[391]3.8299,[392]3.8279,[393]3.8266,[394]3.8302,[395]3.8308,[396]3.8288,[397]3.8343,[398]3.8384,[399]3.8441,[400]3.8440,[401]3.8467,[402]3.8481,[403]3.8517,[404]3.8578,[405]3.8463,[406]3.8358,[407]3.8293,[408]3.8274,[409]3.8334,[410]3.8400,[411]3.8460,[412]3.8548,[413]3.8604,[414]3.8633,[415]3.8676,[416]3.8718,[417]3.8779,[418]3.8752,[419]3.8763,[420]3.8783,[421]3.8845,[422]3.8852,[423]3.8868,[424]3.8902,[425]3.8939,[426]3.8977,[427]3.8999,[428]3.9026,[429]3.9055,[430]3.9096,[431]3.9189,[432]3.9209,[433]3.9205,[434]3.9192,[435]3.9203,[436]3.9199,[437]3.9260,[438]3.9313,[439]3.9294,[440]3.9282,[441]3.9253,[442]3.9253,[443]3.9277,[444]3.9302,[445]3.9307,[446]3.9336,[447]3.9364,[448]3.9389,[449]3.9385,[450]3.9404,[451]3.9402,[452]3.9302,[453]3.9217,[454]3.9148,[455]3.9093,[456]3.9092,[457]3.9029,[458]3.8961,[459]3.8908,[460]3.8831,[461]3.8749,[462]3.8650,[463]3.8582,[464]3.8535,[465]3.8493,[466]3.8408,[467]3.8310,[468]3.8229,[469]3.8141,[470]3.8061,[471]3.7970,[472]3.7887,[473]3.7815,[474]3.7748,[475]3.7671,[476]3.7600,[477]3.7554,[478]3.7537,[479]3.7495,[480]3.7428,[481]3.7371,[482]3.7313,[483]3.7240,[484]3.7171,[485]3.7131,[486]3.7072,[487]3.7031,[488]3.7014,[489]3.6937,[490]3.6864,[491]3.6811,[492]3.6734,[493]3.6671,[494]3.6623,[495]3.6569,[496]3.6523,[497]3.6445,[498]3.6369,[499]3.6335,[500]3.6281,[501]3.6202,[502]3.6153,[503]3.6099,[504]3.6019,[505]3.5971,[506]3.5984,[507]3.5955,[508]3.5947,[509]3.5969,[510]3.5998,[511]3.6043,[512]3.6091,[513]3.6123,[514]3.6174,[515]3.6124,[516]3.6118,[517]3.6115,[518]3.6103,[519]3.6123,[520]3.6127,[521]3.6127,[522]3.6137,[523]3.6138,[524]3.6186,[525]3.6203,[526]3.6208,[527]3.6219,[528]3.6193,[529]3.6216,[530]3.6208,[531]3.6218,[532]3.6269,[533]3.6305,[534]3.6296,[535]3.6309,[536]3.6272,[537]3.6268,[538]3.6286,[539]3.6301,[540]3.6326,[541]3.6327,[542]3.6338,[543]3.6360,[544]3.6386,[545]3.6394,[546]3.6410,[547]3.6407,[548]3.6391,[549]3.6403,[550]3.6396,[551]3.6390,[552]3.6396,[553]3.6396,[554]3.6394,[555]3.6397,[556]3.6406,[557]3.6422,[558]3.6415,[559]3.6429,[560]3.6390,[561]3.6403,[562]3.6398,[563]3.6402,[564]3.6446,[565]3.6468,[566]3.6486,[567]3.6478,[568]3.6500,[569]3.6507,[570]3.6534,[571]3.6550,[572]3.6568,[573]3.6586,[574]3.6577,[575]3.6578,[576]3.6586,[577]3.6582,[578]3.6583,[579]3.6591,[580]3.6567,[581]3.6559,[582]3.6574,[583]3.6597,[584]3.6623,[585]3.6596,[586]3.6575,[587]3.6592,[588]3.6640,[589]3.6685,[590]3.6712,[591]3.6740,[592]3.6743,[593]3.6680,[594]3.6688,[595]3.6642,[596]3.6673,[597]3.6669,[598]3.6654,[599]3.6676,[600]3.6682,[601]3.6649,[602]3.6628,[603]3.6654,[604]3.6651,[605]3.6675,[606]3.6695,[607]3.6698,[608]3.6691,[609]3.6705,[610]3.6736,[611]3.6732,[612]3.6759,[613]3.6747,[614]3.6730,[615]3.6709,[616]3.6732,[617]3.6712,[618]3.6696,[619]3.6683,[620]3.6623,[621]3.6594,[622]3.6556,[623]3.6563,[624]3.6573,[625]3.6587,[626]3.6597,[627]3.6621,[628]3.6633,[629]3.6642,[630]3.6665,[631]3.6688,[632]3.6727,[633]3.6725,[634]3.6750,[635]3.6753,[636]3.6734,[637]3.6676,[638]3.6626,[639]3.6568,[640]3.6505,[641]3.6452,[642]3.6411,[643]3.6352,[644]3.6311,[645]3.6264,[646]3.6248,[647]3.6198,[648]3.6161,[649]3.6155,[650]3.6166,[651]3.6193,[652]3.6193,[653]3.6224,[654]3.6195,[655]3.6188,
./main -t 8 --perplexity -m models/65B/ggml-model-q4_1.bin -f wiki.test.raw 408886.50s user 2150.25s system 894% cpu 12:46:10.65 total

0 replies

neurostar · 2023-03-23T14:36:16Z

neurostar
Mar 23, 2023

Perplexity score for 13B f16 - 5.2455

13B f16 raw data

[1]3.6920,[2]4.1502,[3]4.9227,[4]5.3138,[5]5.4988,[6]5.4418,[7]5.5892,[8]5.7035,[9]5.9589,[10]6.1779,[11]6.3594,[12]6.4056,[13]6.3646,[14]6.4525,[15]6.6488,[16]6.3378,[17]6.2593,[18]6.2369,[19]5.9537,[20]5.9339,[21]5.8613,[22]5.6905,[23]5.6637,[24]5.5727,[25]5.5836,[26]5.4377,[27]5.2660,[28]5.1678,[29]5.0918,[30]4.9584,[31]4.9168,[32]4.9304,[33]4.8871,[34]4.9276,[35]4.9463,[36]4.9698,[37]4.9619,[38]4.9593,[39]4.9869,[40]5.0272,[41]5.0495,[42]5.0830,[43]5.0490,[44]5.0919,[45]5.0944,[46]5.0694,[47]5.0982,[48]5.0816,[49]5.0833,[50]5.0530,[51]5.0610,[52]5.0542,[53]5.0999,[54]5.0905,[55]5.0723,[56]5.0915,[57]5.1093,[58]5.1312,[59]5.1492,[60]5.1848,[61]5.1786,[62]5.2323,[63]5.2568,[64]5.2683,[65]5.3041,[66]5.3034,[67]5.3211,[68]5.3333,[69]5.3604,[70]5.3899,[71]5.4117,[72]5.4453,[73]5.4921,[74]5.4992,[75]5.5082,[76]5.5223,[77]5.5336,[78]5.5201,[79]5.5462,[80]5.5412,[81]5.5490,[82]5.5459,[83]5.5012,[84]5.4898,[85]5.4834,[86]5.4685,[87]5.4029,[88]5.3581,[89]5.3366,[90]5.3268,[91]5.3474,[92]5.3433,[93]5.3451,[94]5.3450,[95]5.3714,[96]5.3681,[97]5.3649,[98]5.3611,[99]5.3541,[100]5.3514,[101]5.3747,[102]5.3704,[103]5.3863,[104]5.3905,[105]5.3922,[106]5.4062,[107]5.4049,[108]5.4198,[109]5.4190,[110]5.4137,[111]5.4315,[112]5.4479,[113]5.4473,[114]5.4460,[115]5.4502,[116]5.4385,[117]5.4379,[118]5.4618,[119]5.4799,[120]5.5083,[121]5.5233,[122]5.5451,[123]5.5813,[124]5.5987,[125]5.5938,[126]5.6289,[127]5.6610,[128]5.6888,[129]5.6772,[130]5.6855,[131]5.6816,[132]5.6779,[133]5.6658,[134]5.6742,[135]5.6741,[136]5.6658,[137]5.6622,[138]5.6486,[139]5.6409,[140]5.6399,[141]5.6127,[142]5.6087,[143]5.5837,[144]5.5680,[145]5.5591,[146]5.5482,[147]5.5533,[148]5.5563,[149]5.5531,[150]5.5525,[151]5.5572,[152]5.5516,[153]5.5421,[154]5.5365,[155]5.5429,[156]5.5409,[157]5.5565,[158]5.5581,[159]5.5589,[160]5.5626,[161]5.5734,[162]5.5487,[163]5.5393,[164]5.5191,[165]5.4943,[166]5.4717,[167]5.4404,[168]5.4139,[169]5.4008,[170]5.3919,[171]5.3718,[172]5.3599,[173]5.3475,[174]5.3211,[175]5.3012,[176]5.2880,[177]5.2717,[178]5.2520,[179]5.2393,[180]5.2323,[181]5.2162,[182]5.2000,[183]5.1881,[184]5.1872,[185]5.1803,[186]5.1812,[187]5.1867,[188]5.1842,[189]5.2003,[190]5.2006,[191]5.2174,[192]5.2312,[193]5.2457,[194]5.2567,[195]5.2754,[196]5.2871,[197]5.3058,[198]5.3190,[199]5.3209,[200]5.3213,[201]5.3145,[202]5.3270,[203]5.3325,[204]5.3275,[205]5.3359,[206]5.3410,[207]5.3371,[208]5.3427,[209]5.3461,[210]5.3518,[211]5.3621,[212]5.3684,[213]5.3776,[214]5.3803,[215]5.3835,[216]5.3955,[217]5.4120,[218]5.4255,[219]5.4255,[220]5.4229,[221]5.4184,[222]5.4185,[223]5.4122,[224]5.4057,[225]5.4022,[226]5.4219,[227]5.4267,[228]5.4340,[229]5.4410,[230]5.4371,[231]5.4522,[232]5.4419,[233]5.4273,[234]5.4128,[235]5.3905,[236]5.3855,[237]5.3770,[238]5.3801,[239]5.3691,[240]5.3601,[241]5.3633,[242]5.3648,[243]5.3641,[244]5.3543,[245]5.3508,[246]5.3410,[247]5.3314,[248]5.3254,[249]5.3221,[250]5.3257,[251]5.3176,[252]5.3128,[253]5.3038,[254]5.2995,[255]5.2906,[256]5.2744,[257]5.2647,[258]5.2581,[259]5.2572,[260]5.2490,[261]5.2439,[262]5.2399,[263]5.2351,[264]5.2117,[265]5.2117,[266]5.2090,[267]5.2029,[268]5.2092,[269]5.2085,[270]5.2094,[271]5.2155,[272]5.2184,[273]5.2198,[274]5.2206,[275]5.2265,[276]5.2322,[277]5.2442,[278]5.2524,[279]5.2606,[280]5.2644,[281]5.2739,[282]5.2792,[283]5.2916,[284]5.3002,[285]5.3081,[286]5.3204,[287]5.3170,[288]5.3223,[289]5.3163,[290]5.3023,[291]5.2894,[292]5.2763,[293]5.2645,[294]5.2652,[295]5.2654,[296]5.2699,[297]5.2690,[298]5.2710,[299]5.2688,[300]5.2602,[301]5.2605,[302]5.2543,[303]5.2461,[304]5.2389,[305]5.2364,[306]5.2260,[307]5.2290,[308]5.2298,[309]5.2168,[310]5.2141,[311]5.2098,[312]5.2113,[313]5.2059,[314]5.2043,[315]5.1917,[316]5.1873,[317]5.1749,[318]5.1589,[319]5.1692,[320]5.1801,[321]5.1847,[322]5.1817,[323]5.1760,[324]5.1741,[325]5.1833,[326]5.1850,[327]5.1856,[328]5.1890,[329]5.1937,[330]5.1960,[331]5.2063,[332]5.2029,[333]5.2105,[334]5.2061,[335]5.2013,[336]5.2036,[337]5.2027,[338]5.2023,[339]5.1981,[340]5.1954,[341]5.2020,[342]5.2052,[343]5.2093,[344]5.2097,[345]5.2112,[346]5.2097,[347]5.2133,[348]5.2170,[349]5.2191,[350]5.2173,[351]5.2186,[352]5.2187,[353]5.2137,[354]5.2144,[355]5.2192,[356]5.2222,[357]5.2193,[358]5.2272,[359]5.2293,[360]5.2260,[361]5.2258,[362]5.2326,[363]5.2434,[364]5.2485,[365]5.2523,[366]5.2542,[367]5.2628,[368]5.2608,[369]5.2622,[370]5.2642,[371]5.2604,[372]5.2651,[373]5.2692,[374]5.2674,[375]5.2671,[376]5.2728,[377]5.2695,[378]5.2721,[379]5.2759,[380]5.2692,[381]5.2662,[382]5.2625,[383]5.2607,[384]5.2607,[385]5.2595,[386]5.2583,[387]5.2581,[388]5.2551,[389]5.2516,[390]5.2464,[391]5.2409,[392]5.2374,[393]5.2370,[394]5.2402,[395]5.2395,[396]5.2344,[397]5.2409,[398]5.2452,[399]5.2521,[400]5.2514,[401]5.2521,[402]5.2531,[403]5.2555,[404]5.2610,[405]5.2458,[406]5.2415,[407]5.2404,[408]5.2414,[409]5.2524,[410]5.2614,[411]5.2707,[412]5.2846,[413]5.2947,[414]5.3007,[415]5.3066,[416]5.3137,[417]5.3231,[418]5.3254,[419]5.3301,[420]5.3378,[421]5.3475,[422]5.3509,[423]5.3564,[424]5.3652,[425]5.3726,[426]5.3786,[427]5.3826,[428]5.3897,[429]5.3933,[430]5.3994,[431]5.4119,[432]5.4150,[433]5.4143,[434]5.4111,[435]5.4124,[436]5.4153,[437]5.4234,[438]5.4307,[439]5.4280,[440]5.4275,[441]5.4232,[442]5.4221,[443]5.4231,[444]5.4248,[445]5.4241,[446]5.4261,[447]5.4284,[448]5.4315,[449]5.4300,[450]5.4311,[451]5.4283,[452]5.4127,[453]5.4031,[454]5.3975,[455]5.3978,[456]5.4018,[457]5.4029,[458]5.4012,[459]5.4008,[460]5.4080,[461]5.4037,[462]5.4000,[463]5.3977,[464]5.3974,[465]5.3952,[466]5.3877,[467]5.3863,[468]5.3841,[469]5.3851,[470]5.3839,[471]5.3789,[472]5.3792,[473]5.3746,[474]5.3732,[475]5.3664,[476]5.3637,[477]5.3551,[478]5.3522,[479]5.3521,[480]5.3541,[481]5.3541,[482]5.3495,[483]5.3454,[484]5.3461,[485]5.3392,[486]5.3327,[487]5.3315,[488]5.3292,[489]5.3238,[490]5.3206,[491]5.3172,[492]5.3105,[493]5.3076,[494]5.3058,[495]5.3035,[496]5.2997,[497]5.2934,[498]5.2907,[499]5.2871,[500]5.2793,[501]5.2722,[502]5.2710,[503]5.2700,[504]5.2624,[505]5.2621,[506]5.2627,[507]5.2573,[508]5.2538,[509]5.2544,[510]5.2565,[511]5.2607,[512]5.2647,[513]5.2672,[514]5.2725,[515]5.2687,[516]5.2678,[517]5.2679,[518]5.2679,[519]5.2701,[520]5.2714,[521]5.2725,[522]5.2739,[523]5.2745,[524]5.2799,[525]5.2827,[526]5.2831,[527]5.2847,[528]5.2795,[529]5.2803,[530]5.2767,[531]5.2764,[532]5.2811,[533]5.2837,[534]5.2817,[535]5.2838,[536]5.2796,[537]5.2778,[538]5.2828,[539]5.2835,[540]5.2850,[541]5.2846,[542]5.2860,[543]5.2881,[544]5.2894,[545]5.2884,[546]5.2888,[547]5.2856,[548]5.2815,[549]5.2815,[550]5.2795,[551]5.2770,[552]5.2751,[553]5.2722,[554]5.2700,[555]5.2681,[556]5.2673,[557]5.2689,[558]5.2656,[559]5.2659,[560]5.2645,[561]5.2647,[562]5.2623,[563]5.2621,[564]5.2663,[565]5.2674,[566]5.2681,[567]5.2662,[568]5.2672,[569]5.2658,[570]5.2684,[571]5.2696,[572]5.2705,[573]5.2710,[574]5.2680,[575]5.2663,[576]5.2656,[577]5.2640,[578]5.2621,[579]5.2620,[580]5.2569,[581]5.2542,[582]5.2542,[583]5.2551,[584]5.2558,[585]5.2499,[586]5.2447,[587]5.2450,[588]5.2493,[589]5.2541,[590]5.2572,[591]5.2589,[592]5.2579,[593]5.2540,[594]5.2552,[595]5.2536,[596]5.2576,[597]5.2557,[598]5.2525,[599]5.2551,[600]5.2542,[601]5.2531,[602]5.2530,[603]5.2556,[604]5.2562,[605]5.2588,[606]5.2602,[607]5.2587,[608]5.2559,[609]5.2568,[610]5.2608,[611]5.2596,[612]5.2618,[613]5.2591,[614]5.2552,[615]5.2495,[616]5.2520,[617]5.2471,[618]5.2429,[619]5.2386,[620]5.2280,[621]5.2231,[622]5.2213,[623]5.2226,[624]5.2231,[625]5.2239,[626]5.2236,[627]5.2262,[628]5.2271,[629]5.2275,[630]5.2305,[631]5.2348,[632]5.2394,[633]5.2383,[634]5.2413,[635]5.2410,[636]5.2375,[637]5.2337,[638]5.2356,[639]5.2326,[640]5.2332,[641]5.2336,[642]5.2384,[643]5.2401,[644]5.2419,[645]5.2406,[646]5.2440,[647]5.2388,[648]5.2399,[649]5.2402,[650]5.2431,[651]5.2473,[652]5.2478,[653]5.2515,[654]5.2462,[655]5.2455,

0 replies

gjmulder · 2023-03-23T15:31:57Z

gjmulder
Mar 23, 2023
Collaborator

5 replies

blackhole89 Mar 23, 2023

Here's another potentially useful plot, showing how well we can('t) use early perplexity values to estimate the final one, at least across different models.

The yellow and green lines are the running ratios of (perplexity up to chunk n / perplexity on all 655 chunks) for the 65b q4_1 and the 13b f16 series that were posted above. The purple line is the ratio of those two ratios. If early measurements were good predictors of the final one (i.e. we could do something like taking the P_(model1)[3]/P_(model1)[655] ratio for one model, divide the P_(model2)[3] perplexity for another model by it, and thereby estimate the final perplexity P_(model2)[655]), we'd expect this to be a line at 1, or at least something that's easily modelled. Instead, we see it isn't: different-size models are good at different parts of Wikitext.

Remains to see if this exercise works any better for different quantizations of the same model. If it did, we could at least evaluate different quantizations and numeric fiddling much faster...

Edit: Same thing, but for 13b q4_0 vs. 13b f16:

This suggests that for different quantizations of the same model, this may be more viable. Formally, define the nth P1-estimate of P2's final perplexity as being P2[n]*P1[655]/P1[n]. Then this seems to become a reasonable estimate around n=30, suggesting that we can pick the quantization and calculation strategy that performs the best at the first 30 chunks of Wikitext and be somewhat confident that it is in fact (not too far from) the best.

gjmulder Mar 23, 2023
Collaborator

I'm about to go on holiday for 3+ weeks next week, so if we can design and smoke test a test suite that I can run with minimal monitoring we have 8000+ CPU hours of free compute available to churn away against 1000+ chunks per model or whatever people think is useful 🏖️ 📉

blackhole89 Mar 24, 2023

I ran both master and the q4_1_more_accel branch + regrouped additions to 30 steps, and it looks like they converge to basically the same. If the estimation I outlined above holds up, maybe it's not worth to even bother with the regrouping, and we should just merge q4_1_more_accel as it is currently upstreamed (i.e., the fastest code under consideration).

master

[1]4.5565,[2]5.0597,[3]5.9601,[4]6.5526,[5]6.6511,[6]6.6109,[7]6.7965,[8]6.9006,[9]7.2246,[10]7.4742,[11]7.7228,[12]7.7703,[13]7.6961,[14]7.7716,[15]8.0268,[16]7.6081,[17]7.4710,[18]7.4267,[19]7.0473,[20]7.0268,[21]6.9285,[22]6.7557,[23]6.7290,[24]6.6439,[25]6.6503,[26]6.4792,[27]6.2956,[28]6.1830,[29]6.0938,[30]5.9276

q4_1_more_accel + regroup

[1]4.3819,[2]4.9055,[3]5.8298,[4]6.4572,[5]6.5991,[6]6.5767,[7]6.7875,[8]6.8693,[9]7.2119,[10]7.4796,[11]7.7147,[12]7.7601,[13]7.6988,[14]7.7769,[15]8.0329,[16]7.6225,[17]7.4931,[18]7.4461,[19]7.0621,[20]7.0353,[21]6.9411,[22]6.7664,[23]6.7310,[24]6.6350,[25]6.6413,[26]6.4746,[27]6.2868,[28]6.1829,[29]6.0933,[30]5.9268

jasontitus Mar 24, 2023

My M1 ultra run of 64B q4_0 is done - 3.6902 - 13 hours

64B_q4_0.txt

Green-Sky Mar 24, 2023
Collaborator Author

in the plot you can clearly see that 30 and 65 where trained on more tokens.

jasontitus · 2023-03-24T07:15:18Z

jasontitus
Mar 24, 2023

I just started a run on a 65B gptq model but it looked noticeably worse (something like 3.4 vs 3.0 on the first couple iterations) and seemed like it wouldn't be worth running the full set on. Is the the gptq inference code solidified yet? If so, is there anyway to generate a gptq output w/o an Nvidia GPU (I'm not sure how up to date the quantization was on the model I have and I'd like to make sure it is a current version).

1 reply

BadisG Mar 24, 2023

@jasontitus you should test on a GPTQ .pt file that has a group size of 128 to get the expected best results:
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-64b c4 --wbits 4 --groupsize 128 --save llama64b-4bit-128g.pt

The problem is that we don't know how to convert those files into ggml yet
#442 (comment)

gjmulder · 2023-03-24T08:44:31Z

gjmulder
Mar 24, 2023
Collaborator

Out of curiosity I'm doing a 3 day run with the 5521 total chunks in wiki.train.raw. Progress will be updated here:

5 replies

Green-Sky Mar 24, 2023
Collaborator Author

cool, but which file is that exactly ? because the wiki.test.raw should only have 655

Green-Sky Mar 24, 2023
Collaborator Author

or did you change context size perhaps? #407 also changes how that work.

gjmulder Mar 24, 2023
Collaborator

cool, but which file is that exactly ? because the wiki.test.raw should only have 655

wiki.train.raw, sorry. At the time the plot title was correct, but not my post description. 🤦

Green-Sky Mar 24, 2023
Collaborator Author

wow, that graph is very unstable, even past batch 1000

gjmulder Mar 26, 2023
Collaborator

@Green-Sky:

Call:
lm(formula = Perplexity ~ Chunk, data = filter(perps, Run == "7B_q4_0_655chunks_test"))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27251 -0.11085  0.02698  0.06537  0.41289 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.734e+00  1.422e-02 473.452  < 2e-16 ***
Chunk       -2.613e-04  3.464e-05  -7.543  1.9e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1308 on 553 degrees of freedom
Multiple R-squared:  0.09329,	Adjusted R-squared:  0.09165 
F-statistic:  56.9 on 1 and 553 DF,  p-value: 1.898e-13

Call:
lm(formula = Perplexity ~ Chunk, data = filter(perps, Run == "7B_q4_0_5521chunks_train"))

Residuals:
      Min        1Q    Median        3Q       Max 
-0.199060 -0.035141  0.001171  0.036785  0.303817 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.915e+00  1.827e-03 3785.03   <2e-16 ***
Chunk       -1.969e-05  6.120e-07  -32.17   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.06282 on 5017 degrees of freedom
Multiple R-squared:  0.171,	Adjusted R-squared:  0.1709 
F-statistic:  1035 on 1 and 5017 DF,  p-value: < 2.2e-16

jasontitus · 2023-03-24T13:59:45Z

jasontitus
Mar 24, 2023

I also just ran a 30B q4_1 run last night. It finished at 4.2701.

30B-q4_1-perp.txt

17 replies

Green-Sky Mar 24, 2023
Collaborator Author

I don't have enough disk space for the bigger models

i am sooo feeling that.

ping me if you want me to test 30B, i cant run the 65B though.

Green-Sky Mar 24, 2023
Collaborator Author

@glinscott looks like we are switching on --memory_f16 by default 9330ff0#diff-e5f52015db7fc454918552b75cd88ce194845d744c17d78eecde2e7777d27076R82

might skew perplexity a bit.

ggerganov Mar 24, 2023
Maintainer

Should I keep F32 for now?
I thought we have seen that there is almost no difference between the 2 modes

glinscott Mar 24, 2023
Collaborator

it was a pretty tiny delta from the measurements I did, so makes sense to switch on by default IMO

Green-Sky Mar 24, 2023
Collaborator Author

yes F16 is good, but we did not extensively test. #270

observed changes where only 0.00x

jasontitus · 2023-03-24T16:06:06Z

jasontitus
Mar 24, 2023

I can run the 65B models again, but it doesn’t sound like anything should change much since 2 days ago. I thought the changes were just to give a little more stability to the runs and that the absolute scores shouldn’t change much. A single run on 65B takes about 13 or 14 hours so I’m not too eager to redo them unless needed.

…

On Fri, Mar 24, 2023 at 9:01 AM Erik Scholz ***@***.***> wrote: I don't have enough disk space for the bigger models i am sooo feeling that. ping me if you want me to test 30B, i cant run the 65B though. — Reply to this email directly, view it on GitHub <#406 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUOXWCABABRKGVUHTUVQDW5XAPFANCNFSM6AAAAAAWEHG6B4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

jasontitus · 2023-03-24T21:04:06Z

jasontitus
Mar 24, 2023

OK, finished a run of 30B f16 (non-quantized - not with the --memory_f16 option) that can compare to the q4_1 - 4.1539.
30B-f16-perp.txt

3 replies

Green-Sky Mar 24, 2023
Collaborator Author

wow, with q4_1 being 4.2701, there is only 0.1162 difference.

gjmulder Mar 24, 2023
Collaborator

OK, finished a run of 30B f16 (non-quantized - not with the --memory_f16 option) that can compare to the q4_1 - 4.1539. 30B-f16-perp.txt

Updated plot at #406 (comment)

j-f1 Mar 24, 2023
Collaborator

@gjmulder What do you think about moving the chart to the OP so it can be more easily referenced?

ggerganov · 2023-03-25T15:06:25Z

ggerganov
Mar 25, 2023
Maintainer

FYI with the latest BLAS fixes, I believe that perplexity computations should be much faster with big batch sizes ( > 255) if you link against OpenBLAS:

# on x86
make clean && LLAMA_OPENBLAS=1 make -j

Let me know if this is true.
Here are quick tests on my M1 without and with BLAS:

# no BLAS, 7B
make clean && LLAMA_NO_ACCELERATE=1 make -j && ./main --perplexity -m ./models/7B/ggml-model-q4_0.bin -f build/wiki.test.raw -t 8
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks
21.11 seconds per pass - ETA 3.84 hours

# no BLAS, 13B
make clean && LLAMA_NO_ACCELERATE=1 make -j && ./main --perplexity -m ./models/13B/ggml-model-q4_0.bin -f build/wiki.test.raw -t 8
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks
41.39 seconds per pass - ETA 7.53 hours

# with BLAS, 7B
make clean && make -j && ./main --perplexity -m ./models/7B/ggml-model-q4_0.bin -f build/wiki.test.raw -t 8
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks
10.43 seconds per pass - ETA 1.90 hours

# with BLAS, 13B
make clean && make -j && ./main --perplexity -m ./models/13B/ggml-model-q4_0.bin -f build/wiki.test.raw -t 8
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks
19.34 seconds per pass - ETA 3.52 hours

So about x2 speed-up when using BLAS.

0 replies

jasontitus · 2023-03-25T15:16:05Z

jasontitus
Mar 25, 2023

Does that mean it will also get faster on Apple Silicon now?

…

On Sat, Mar 25, 2023 at 8:06 AM Georgi Gerganov ***@***.***> wrote: FYI with the latest BLAS fixes, I believe that perplexity computations should be much faster with big batch sizes ( > 255) if you link against OpenBLAS — Reply to this email directly, view it on GitHub <#406 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUOXSCZGM2LDIDZLYCADLW54CXXANCNFSM6AAAAAAWEHG6B4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

gjmulder Mar 26, 2023
Collaborator

Does that mean it will also get faster on Apple Silicon now?

I don't have a Mac so YMMV, but https://github.com/JuliaLang/julia/issues/42312 might be relevant to seeing whether you can link against Apple's proprietary BLAS instead of OpenBLAS.

ggerganov Mar 26, 2023
Maintainer

@jasontitus
On Mac we link to Apple's built-in BLAS by default, so it will be faster without having to do anything extra

gjmulder · 2023-03-26T07:24:06Z

gjmulder
Mar 26, 2023
Collaborator

Test plan for my 16 core AMD Threadripper 1950X for the new few weeks while I'm away:

My 3 day pre-BLAS run with 7B_q4_0 against wiki.train.raw above is indicating wiki.test.raw is giving us somewhat optimistic perplexity scores. Relative comparison using wiki.test.raw is hower likely valid.

What do people think is the most interesting to explore and / or technically feasible with 128GB of RAM?

e.g.

./main -t 16 --perplexity -m models/llama-13B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.train.raw -b 00512 -c 00512
./main -t 16 --perplexity -m models/llama-13B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.train.raw -b 01024 -c 01024
./main -t 16 --perplexity -m models/llama-13B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.train.raw -b 02048 -c 02048
./main -t 16 --perplexity -m models/llama-13B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.train.raw -b 04096 -c 04096
./main -t 16 --perplexity -m models/llama-13B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.train.raw -b 08192 -c 08192
./main -t 16 --perplexity -m models/llama-13B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.train.raw -b 16384 -c 16384

or maybe?

./main -t 16 --perplexity -m models/llama-7B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw -b 1024 -c 1024
./main -t 16 --perplexity -m models/llama-7B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw -b 2048 -c 2048
./main -t 16 --perplexity -m models/llama-13B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw -b 1024 -c 1024
./main -t 16 --perplexity -m models/llama-13B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw -b 2048 -c 2048
./main -t 16 --perplexity -m models/llama-30B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw -b 1024 -c 1024
./main -t 16 --perplexity -m models/llama-30B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw -b 2048 -c 2048
./main -t 16 --perplexity -m models/llama-65B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw -b 1024 -c 1024
./main -t 16 --perplexity -m models/llama-65B-ggml/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw -b 2048 -c 2048

11 replies

gjmulder Mar 26, 2023
Collaborator

Should I be independently grid searching batch size and context size?

gjmulder Mar 28, 2023
Collaborator

Below is an exploration for context_size == batch_size for values 512, 768, 1024, and 1280. It segfaulted for context sizes of 1536, 1792, and 2048. Note that since larger context sizes produce larger chunks, the x-axis is now scaled as a percentage of chunks of the total chunks for the run.

I agree with @BadisG's $0.02 in #129 (comment), except there's some curious behaviour for context_size == batch_size == 1280.

gjmulder Mar 30, 2023
Collaborator

I can confirm that different batch sizes are numerically stable 🤣

Also, at least on my 16 core AMD hardware and using OpenBLAS, 512 and 1024 batch sizes are near identical in compute and memory I/O performance, except for the number of context switches. I wonder if there would be any differences with

OPENBLAS_NUM_THREADS=1

or

OPENBLAS_NUM_THREADS=2

given -t 16 is using an average of about 18 hypercores?

perplexity$ grep -A 17 Performance 13B_q4_0_btch
 Performance counter stats for './perplexity -t 16 --perplexity -m /data/llama/llama-13B-ggml/ggml-model-q4_0.bin -f /data/llama/wikitext-2-raw/wiki.test.raw -b 512 -c 1024':

      662183047.63 msec task-clock                #   17.815 CPUs utilized          
          24065567      context-switches          #   36.343 /sec                   
             17479      cpu-migrations            #    0.026 /sec                   
           2992555      page-faults               #    4.519 /sec                   
  2349030554006549      cycles                    #    3.547 GHz                      (16.67%)
    22138475576781      stalled-cycles-frontend   #    0.94% frontend cycles idle     (16.67%)
   451567634730279      stalled-cycles-backend    #   19.22% backend cycles idle      (16.67%)
  1990266573431552      instructions              #    0.85  insn per cycle         
                                                  #    0.23  stalled cycles per insn  (16.67%)
   174702404834644      branches                  #  263.828 M/sec                    (16.67%)
     2990295156773      branch-misses             #    1.71% of all branches          (16.67%)

   37169.382357201 seconds time elapsed

  507155.490375000 seconds user
  155023.145430000 seconds sys
--
 Performance counter stats for './perplexity -t 16 --perplexity -m /data/llama/llama-13B-ggml/ggml-model-q4_0.bin -f /data/llama/wikitext-2-raw/wiki.test.raw -b 1024 -c 1024':

      662585222.07 msec task-clock                #   17.835 CPUs utilized          
          13085731      context-switches          #   19.750 /sec                   
             16125      cpu-migrations            #    0.024 /sec                   
           2992539      page-faults               #    4.516 /sec                   
  2348655379927748      cycles                    #    3.545 GHz                      (16.67%)
    22037498769775      stalled-cycles-frontend   #    0.94% frontend cycles idle     (16.67%)
   451544376607233      stalled-cycles-backend    #   19.23% backend cycles idle      (16.67%)
  1990696383781254      instructions              #    0.85  insn per cycle         
                                                  #    0.23  stalled cycles per insn  (16.67%)
   174756680824883      branches                  #  263.750 M/sec                    (16.67%)
     2994803847151      branch-misses             #    1.71% of all branches          (16.67%)

   37149.877753316 seconds time elapsed

  507320.608394000 seconds user
  155260.266772000 seconds sys

ivanstepanovftw Apr 9, 2023
Collaborator

The greater context length, the more precise model can predict next token, that's why you see improvement

gjmulder Apr 9, 2023
Collaborator

The greater context length, the more precise model can predict next token, that's why you see improvement

Agreed. Except if you look at the plot it is actually the 1024 context length and not the 1280 length that has less improvement per chunk. There will likely be some limiting factors such as model design and some edge cases.

KerfuffleV2 · 2023-06-10T08:56:46Z

KerfuffleV2
Jun 10, 2023
Collaborator

@ikawrakow Any chance you could post/link to the perplexity data you used to generate the graphs/tables for k-quants added in PR #1684? Right now there's really no information for models >13B and having that available would be really helpful even if it's only for the new quantizations. (I can deal with it in any format, doesn't need to be cleaned up or anything.)

I'd do it myself but unfortunately my hardware falls far short of completing something like that in a reasonable amount of time.

0 replies

ikawrakow · 2023-06-10T09:40:33Z

ikawrakow
Jun 10, 2023

OK, here is a table

Model	Quantization	Perplexity
33B	Q2_K	4.7950
33B	Q3_K_S	4.5048
33B	Q3_K_M	4.3594
33B	Q3_K_L	4.3094
33B	Q4_K_S	4.2486
33B	Q4_K_M	4.2081
33B	Q5_K_S	4.1778
33B	Q5_K_M	4.1675
33B	Q6_K	4.1598
33B	fp16	4.1557
65B	Q2_K	4.1017
65B	Q3_K_S	3.8682
65B	Q3_K_M	3.6991
65B	Q4_K_M	3.5836
65B	Q5_K_M	3.5511
65B	Q6_K	3.5433
65B	fp16	3.5393

I haven't done Q4_K_S and Q5_K_S for the 65B model, but you can see from the curves posted in #1684 that the decrease in perplexity is quite smooth, so one can estimate them to be somewhere in the middle between Q3_K_L and Q4_K_M / Q4_K_M and Q5_K_M.

1 reply

LostRuins Jun 14, 2023
Collaborator

Just wondering if you have any comparisons with the older quants for 30b and 65b, since the main readme doesnt have those

KerfuffleV2 · 2023-06-10T09:44:52Z

KerfuffleV2
Jun 10, 2023
Collaborator

@ikawrakow Thank you! Do you have the ones for 7B and 13B also? Sorry I wasn't clear, when I mentioned that only 7B and 13B were currently available I was talking about the section in the main README which only includes the non-k-quants quantizations.

(I'd like to use this to generate other data/make comparisons so it's better if the source isn't an estimate.)

0 replies

ikawrakow · 2023-06-10T10:02:40Z

ikawrakow
Jun 10, 2023

@KerfuffleV2 You need something different compared to what is already provided in the description of #1684?

1 reply

KerfuffleV2 Jun 10, 2023
Collaborator

You need something different compared to what is already provided

@ikawrakow Please forgive me for wasting your time, I'm just an idiot. Thanks again!

KerfuffleV2 · 2023-06-14T15:30:27Z

KerfuffleV2
Jun 14, 2023
Collaborator

Oh yeah, I forgot to post this but here is what I generated based on the information available. Note: I had to calculate the sizes for 33B and 65B based on the other models and I suspect it may not actually be correct so take the stats involving size with a huge chunk of salt.

Here is the horrendous script that generated the below: https://gist.github.com/KerfuffleV2/d072237b4a9386e80cdc302f923843db (it started as a comprehension in the Python REPL so it never got a chance to be a real program)

The figures came from the main README and ikawrakow's response here + k-quants PR. I didn't generate any of them myself, just manipulated them.

Legend

+ppl - increase in perplexity compared to the f16 (or f32, I'm not sure but I'm just going to say f16 from now on) model.
+ppl % - percentage perplexity increased compared to the f16 model.
+ppl 13b to 7b % - If you go from a 13b f16 model to a 7b f16 model, perplexity increases by some amount. This percentage is on that scale. The idea is that people generally would say they can tell a difference between a 13b and a 7b model.
size 16bit % - Compared to the f16 model (or you could say 100 - this column is the percentage reduction).
+ppl per -1G - The amount perplexity increased per 1G reduced from the f16 model.

edit: Manually generated, but for reference:

from	to	-B	-B %	+ppl	+ppl per -1B
65B	33B	-32B	49.24%	0.6164	0.0192
33B	13B	-20B	60.61%	1.0986	0.0549
13B	7B	-6B	46.16%	0.6523	0.1087

Based on full quality models.

7B

name	+ppl	+ppl %	+ppl 13b to 7b %	size	size 16bit %	+ppl per -1G
q2_k	0.8698	14.726%	133.344%	2.67G	20.54%	0.084201
q3_ks	0.5505	9.320%	84.394%	2.75G	21.15%	0.053707
q3_km	0.2437	4.126%	37.360%	3.06G	23.54%	0.024517
q3_kl	0.1803	3.053%	27.641%	3.35G	25.77%	0.018684
q4_0	0.2499	4.231%	38.311%	3.50G	26.92%	0.026305
q4_1	0.1846	3.125%	28.300%	3.90G	30.00%	0.020286
q4_ks	0.1149	1.945%	17.615%	3.56G	27.38%	0.012172
q4_km	0.0535	0.906%	8.202%	3.80G	29.23%	0.005815
q5_0	0.0796	1.348%	12.203%	4.30G	33.08%	0.009149
q5_1	0.0415	0.703%	6.362%	4.70G	36.15%	0.005000
q5_ks	0.0353	0.598%	5.412%	4.33G	33.31%	0.004072
q5_km	0.0142	0.240%	2.177%	4.45G	34.23%	0.001661
q6_k	0.0044	0.074%	0.675%	5.15G	39.62%	0.000561
q8_0	0.0004	0.007%	0.061%	6.70G	51.54%	0.000063
f16	0.0000	0.000%	0.000%	13.00G	100.00%	0.000000

13B

name	+ppl	+ppl %	+ppl 13b to 7b %	size	size 16bit %	+ppl per -1G
q2_k	0.6002	11.423%	92.013%	5.13G	20.52%	0.030206
q3_ks	0.3490	6.642%	53.503%	5.27G	21.08%	0.017689
q3_km	0.1955	3.721%	29.971%	5.88G	23.52%	0.010225
q3_kl	0.1520	2.893%	23.302%	6.45G	25.80%	0.008194
q4_0	0.1317	2.507%	20.190%	6.80G	27.20%	0.007236
q4_1	0.1065	2.027%	16.327%	7.60G	30.40%	0.006121
q4_ks	0.0861	1.639%	13.199%	6.80G	27.20%	0.004731
q4_km	0.0459	0.874%	7.037%	7.32G	29.28%	0.002596
q5_0	0.0313	0.596%	4.798%	8.30G	33.20%	0.001874
q5_1	0.0163	0.310%	2.499%	9.10G	36.40%	0.001025
q5_ks	0.0242	0.461%	3.710%	8.36G	33.44%	0.001454
q5_km	0.0095	0.181%	1.456%	8.60G	34.40%	0.000579
q6_k	0.0025	0.048%	0.383%	9.95G	39.80%	0.000166
q8_0	0.0005	0.010%	0.077%	13.00G	52.00%	0.000042
f16	0.0000	0.000%	0.000%	25.00G	100.00%	0.000000

33B

name	+ppl	+ppl %	+ppl 13b to 7b %	size	size 16bit %	+ppl per -1G
q2_k	0.6393	15.384%	98.007%	12.93G	20.52%	0.012768
q3_ks	0.3491	8.401%	53.518%	13.29G	21.10%	0.007023
q3_km	0.2037	4.902%	31.228%	14.82G	23.52%	0.004228
q3_kl	0.1537	3.699%	23.563%	16.25G	25.79%	0.003288
q4_ks	0.0929	2.235%	14.242%	17.16G	27.24%	0.002027
q4_km	0.0524	1.261%	8.033%	18.44G	29.27%	0.001176
q5_ks	0.0221	0.532%	3.388%	21.05G	33.41%	0.000527
q5_km	0.0118	0.284%	1.809%	21.65G	34.37%	0.000285
q6_k	0.0041	0.099%	0.629%	25.05G	39.76%	0.000108
f16	0.0000	0.000%	0.000%	63.00G	100.00%	0.000000

65B

name	+ppl	+ppl %	+ppl 13b to 7b %	size	size 16bit %	+ppl per -1G
q2_k	0.5624	15.890%	86.218%	25.65G	20.52%	0.005661
q3_ks	0.3289	9.293%	50.422%	26.35G	21.08%	0.003334
q3_km	0.1598	4.515%	24.498%	29.40G	23.52%	0.001672
q4_km	0.0443	1.252%	6.791%	36.60G	29.28%	0.000501
q5_km	0.0118	0.333%	1.809%	43.00G	34.40%	0.000144
q6_k	0.0040	0.113%	0.613%	49.75G	39.80%	0.000053
f16	0.0000	0.000%	0.000%	125.00G	100.00%	0.000000

0 replies

ianscrivener · 2023-06-29T22:29:16Z

ianscrivener
Jun 29, 2023

All,
Soon I'll be coding up automated perplexity tests (see Azure CI brainstorming discussion)

@ggerganov suggests that the standard test model be OpenLlama

@SlyEcho mentions a "truncated wiki.test.raw perplexity test"

Q1: If I was only doing 1 test... should it test on OpenLlama 3B, 7B or 13B?
(ideally 3B as it is less download bandwidth per perplexity CI/CDtest - unless there are good reasons to use other)

Q2: HOW do I do a "truncated perplexity test"?

Q3: is i it likely that the above perplexity test will be a relevant comparison say in 3, 6, 9 months time?

6 replies

ianscrivener Jun 29, 2023

thanks @Green-Sky
OK.. so no easy way to automate that in CI/CD 😞

How hard would it to be to add a -truncate argument for ./perplexity ?

SlyEcho Jun 29, 2023
Collaborator Sponsor

It is nothing special, you just take the first 406 lines like with a command:

head -n406 wiki.test.raw > wiki.test.raw.406

ianscrivener Jun 29, 2023

aha! Now that you explain it... it makes so much sense 😳
I misunderstood that were after the 406th perplexity score

BTW: why 406... and not 405 or 408?
any reason?

SlyEcho Jun 30, 2023
Collaborator Sponsor

That's just where one of the text blocks ends, and also:

One additional motivation for choosing exactly 406 lines is that it chunks nicely into 72 chunks for a 512 context size, 36 for a 1024 context size, and 18 for a 2048 context size.

gjmulder Jun 30, 2023
Collaborator

aha! Now that you explain it... it makes so much sense
I misunderstood that were after the 406th perplexity score

@ianscrivener a bit more background info here: Quality Scores for a Herd of Llamas

My (limited) understanding is that the choice of wiki.test.raw is arbitrary. It could be any text the model should be familiar with. Therefore the choice of how many chunks of the file to use for perplexity is also arbitrary. I posted some comparative plots above of chunks versus perplexity that indicated the relative performance of different llama configs didn't change for shorter perplexity tests, so using a shorter file enabled me computationally to do a larger grid search of different model configs, including number of weights.

Note also that since different models are trained on different data sets, perplexity isn't an ideal metric for comparing between different models. Furthermore there are issues with the Open LLama tokenizer. The tokenizers are also not identical.

ianscrivener · 2023-06-30T01:05:20Z

ianscrivener
Jun 30, 2023

Open Llama 3B perplexity abbreviated test

Mac M2 2022 - 8 CPU cores (4 performance, 4 efficiency) - 16Gb RAM
built for Mac METAL (LLAMA_METAL=1)
406 lines of wiki.test.raw

make clean && LLAMA_METAL=1 make -j && time ./perplexity -m ./models/3B/open-llama-3b-q4_0.bin -f build/wiki.test.raw.406 -t 8

perplexity: 9.20 seconds per pass - ETA 10 minutes

[68]8.2964

llama_print_timings:        load time =  9362.63 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 365507.71 ms / 34816 tokens (   10.50 ms per token,    95.25 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 368908.22 ms

360.09s user 
4.75s system 
98% cpu 
6:08.93 total

0 replies

gjmulder · 2023-06-30T03:04:43Z

gjmulder
Jun 30, 2023
Collaborator

Q3: is i it likely that the above perplexity test will be a relevant comparison say in 3, 6, 9 months time?

@ianscrivener I checked out your Azure CI brainstorming discussion and noted that you're looking at perplexity performance and not quality? The way perplexity uses llama.cpp looks to be distinctly different from interactive use, as can be seen from comparing your perplexity run above to the output of this run (no GPU):

$ ./main -m /data/llama/7B/ggml-model-f16.bin -p "I believe the meaning of life is" -c 2048 -n 512 --ignore-eos -s 8 -n 64 -t 8
[..]
llama_print_timings:        load time =  1210.91 ms
llama_print_timings:      sample time =    42.64 ms /    64 runs   (    0.67 ms per token,  1501.01 tokens per second)
llama_print_timings: prompt eval time =  1440.90 ms /     8 tokens (  180.11 ms per token,     5.55 tokens per second)
llama_print_timings:        eval time = 16689.58 ms /    63 runs   (  264.91 ms per token,     3.77 tokens per second)
llama_print_timings:       total time = 18189.84 ms

3 replies

ianscrivener Jun 30, 2023

@gjmulder,
I'm still trying to get my head around all this and understand the task requirements ... reverse engineer the brief.

my understanding (perhaps wrong) is that we are aiming to get
(1)perplexity score - eg 8.2964 (per my test above)
(2) perplexity test runtime - eg 360.09s (per my test above)

for a given a combination of
(a) model - eg open-llama-3b-q4_0.bin
(b) hardware +GPU Library - eg Mac M2 2022 - 8 CPU cores (4 performance, 4 efficiency) - 16Gb RAM + Metal
(c) test data - eg wiki.test.raw.406
(d) context size - eg 512, 1024, 2048

each time there is a llama.cpp release we'd run a CI/CD process to get the test results (1 & 2)
multiple tests would be run - totalling abc*d

So - I was under the impression that the perplexity score is the quality you are referring to?!

ianscrivener Jun 30, 2023

@ggerganov am I on the right track for what your CI/CD goals are?

gjmulder Jun 30, 2023
Collaborator

So - I was under the impression that the perplexity score is the quality you are referring to?!

EDIT: Yes. It is a measure of the quality of generative ability of the model, and is a direct function of the dataset the model is trained on, the tokenizer used, whether or not the model over-fitted or under-fitted the training dataset, and the choice of text used in the perplexity evaluation. Therefore, if any one of these training design and evaluation choices is different, perplexity should be not used to compare different models. As long as we stick to only FB Llamas or only Open Llamas and the same perplexity evaluation text, perplexity is a valid metric.

I replied in #1985 specifically w.r.t. performance and not perplexity (quality).

ianscrivener · 2023-07-21T00:30:00Z

ianscrivener
Jul 21, 2023

In response to @ggerganov's call for perplexity and latency testing for llama.cpp I've coded llama.cpp perplexity scorecard... a helper project to run and gather ./perplexity results, uploads JSON data to AWS S3 for review, analysis and charting.

0 replies

klosax · 2023-07-22T18:05:56Z

klosax
Jul 22, 2023

Why not consider moving to a better and more used scoring method like HellaSwag?

Added support for measurement of a HellaSwag-like score in PR #2312 and started a discussion in #2321.

3 replies

ianscrivener Jul 22, 2023

thanks @klosax,
yes I saw your HellaSwag-like post. Yes, I'll add your HellaSwag-like test to the my MLOps scorecard project!

Currently, perplexity (and now HellaSwag-like) are being run manually... with results collated/shared manually and charts/tables made manually. There is no central record (database) of results... just posts in Github discussions.

So I'm working towards
(1) simplified batch benchmarking - that can be run by CI/CD
(2) automated posting of results to a single central datastore
(3) automated graphs & tables
(4) public dataset for further analysis

Rather than just one person, "someone with great computing power", running the benchmarks... the idea is that the broader community can run benchmarking.

klosax Jul 23, 2023

Great!
I'm looking forward to see the project up and running.

Remember that my test is not the same HellaSwag test that the Open LLM Leaderboard are using, so maybe we could call it the HellaSwagish test instead. The scores are linearly correlated but are not the same numbers as on the Leaderboard. Also, I suggest only using the 600 line dataset since the scores have stabilized at that point.

ianscrivener Jul 23, 2023

"HellaSwagish" it is!.
600 lines.. agree - I've just run 200 & 800 locally and seen your charts

YellowRoseCx · 2023-08-30T04:37:34Z

YellowRoseCx
Aug 30, 2023

Is there any place with updated perplexity scores for LLAMA2 and current codebase with GGUF?

10 replies

KerfuffleV2 Sep 22, 2023
Collaborator

@BarfingLemurs
#2807 improved perplexity a bit for Q2_K, Q3_K and Q4_K on LLaMA2 70B. I don't think your table accounts for that pull yet? (Thanks for gathering the information!)

BarfingLemurs Sep 22, 2023

updated, ty

YellowRoseCx Sep 22, 2023

Also, on hellaswag: #2321 (comment), for Llama-2-7B:

Quantization Difference to fp16

Q6_K +0.060 +/- 0.111

Q5_K_M -0.280 +/- 0.144

Q5_K_S -0.460 +/- 0.148

Q4_K_M -0.460 +/- 0.175

Q4_K_S -0.780 +/- 0.190

Q3_K_M -1.420 +/- 0.255

Q2_K -2.060 +/- 0.312

Q5_0 -0.200 +/- 0.162

Q5_1 -0.220 +/- 0.148

Q4_0 -0.960 +/- 0.209

Q4_1 -1.000 +/- 0.224

(k quants perform as expected!)

Also, llama-1-7b speed comparison from #3093

Am I reading it right that all of the Q3 models are slower than q4 models?

KerfuffleV2 Sep 22, 2023
Collaborator

Am I reading it right that all of the Q3 models are slower than q4 models?

Not super surprising. Q3 is an odd size, and may also be more complicated internally. Odd sizes like that are less likely to be nicely aligned, fit in chunks for parallel operations (which usually use factors of 2), etc.

YellowRoseCx Sep 22, 2023

for me, q3_k_m 70b models on a 6800xt, generation speed starts around 600ms a token and moves up to around 800ms at higher context sizes, but for the same model in Q4_k_m or q4_k_s, it's between 900 and 1150ms a token. Maybe it has to do with me only partially offloading

ianscrivener · 2023-09-20T00:28:29Z

ianscrivener
Sep 20, 2023

Agree. I'd also like to see perplexity and Hellaswag scores, updated with each code release at least for a handful of llama2 gguf models. I'm really interested to see and understand the improvements over time - both in the code and the models. Ie quality and performance benchmarks over time.

I've done preliminary code for this (both in python and node.js) after @ggerganov put out the call and said that sufficient Azure cloud resources would be soon available. I put an MLOps benchmarking roadmap to @ggerganov in an email but did not get a reply. There does not seem to be very much interest in this from the C++ developers.

I'm off grid, accessing via 4G, and only have a Macbook Pro. So without access to the cloud GPU that Azure has given to the project I cannot do proceed.

Personally, I'd like to see llama.cpp grow beyond just the (excellent) core C++ library, ie adding;

project web site
documentation
(as discussed) quality and performance benchmarks
futher and more targeted development of specific 'products' - building upon ./main and ./server

0 replies

goerch · 2023-09-21T19:28:58Z

goerch
Sep 21, 2023
Collaborator

Here are my cents:

project web site

For others to decide.

documentation

We announce a couple of supported models without documentation (or a link?) on how to convert and run. This looks (and probably is) bad.

(as discussed) quality and performance benchmarks

Personally I'm quite interested.

futher and more targeted development of specific 'products' - building upon ./main and ./server

I'd assume downstream is working on it.

2 replies

ianscrivener Sep 21, 2023

downstream?

goerch Sep 21, 2023
Collaborator

Consumers of llama.cpp (might be upstream, not a native speaker, sorry)?

BarfingLemurs · 2023-10-03T16:42:55Z

BarfingLemurs
Oct 3, 2023

Mistral 7b compared to other llamas:

Q4_K_M:
Llama 1 - 5.96
Llama 2 - 5.88
Mistral v0.1 - 5.75

(If perplexity isn't a fair benchmarking tool, we can use hellaswag score for kquants)

3 replies

Green-Sky Oct 3, 2023
Collaborator Author

perplexity is only ever fair, when the exact same data was used to train the models.

KerfuffleV2 Oct 3, 2023
Collaborator

Stuff like whether the model is instruct tuned can also make a difference. If the model expects a specific prompt format, the chunks from perplexity won't conform to that.

Basically, as far as I understand it, perplexity doesn't really work well comparing different models with each other. It's more meaningful comparing stuff like the effects of different quantizations on the same model (and even then, pretty much only looking at the relative changes rather than the absolute value).

FNsi Oct 4, 2023

I'd like truthful QA.

taratt · 2024-05-08T22:00:50Z

taratt
May 8, 2024

Hi.
I am not sure if this is the best place to ask this but I have a question regrading llama 2 7b perplexity. I am using the huggingface model meta-llama/Llama-2-7b-hf and with a context size of 4096, no matter what I do I get a perplexity of 7.84 on Wikitext2. I am using the wanda codebase (https://github.com/locuslab/wanda/tree/main) for this. Is this number familiar to anyone? ANY help would be appreciated.

1 reply

BarfingLemurs May 9, 2024

@taratt

https://huggingface.co/datasets/ggml-org/ci/blob/main/wikitext-2-raw-v1.zip For the above, we only use wiki.test.raw, evaluating at 512ctx.

Perplexity (Quality of Generation) Scores #406

Green-Sky Mar 22, 2023 Collaborator

Mostly Default ./perplexity settings with all of wiki.test.raw

Context sizes: (512 | 1024 | 2048) ⨯ (7B | 13B | 30B | 65B) ⨯ (llama | alpaca[-lora] | vicuna-GPTQ) models, first 406 lines of wiki.test.raw:

History

Replies: 57 comments · 161 replies

Green-Sky Mar 22, 2023 Collaborator Author

glinscott Mar 22, 2023 Collaborator

glinscott Mar 22, 2023 Collaborator

Green-Sky Mar 22, 2023 Collaborator Author

glinscott Mar 22, 2023 Collaborator

glinscott Mar 23, 2023 Collaborator

glinscott Mar 23, 2023 Collaborator

gjmulder Mar 23, 2023 Collaborator

gjmulder Mar 23, 2023 Collaborator

Green-Sky Mar 24, 2023 Collaborator Author

gjmulder Mar 24, 2023 Collaborator

Green-Sky Mar 24, 2023 Collaborator Author

Green-Sky Mar 24, 2023 Collaborator Author

gjmulder Mar 24, 2023 Collaborator

Green-Sky Mar 24, 2023 Collaborator Author

gjmulder Mar 26, 2023 Collaborator

Green-Sky Mar 24, 2023 Collaborator Author

Green-Sky Mar 24, 2023 Collaborator Author

ggerganov Mar 24, 2023 Maintainer

glinscott Mar 24, 2023 Collaborator

Green-Sky Mar 24, 2023 Collaborator Author

Green-Sky Mar 24, 2023 Collaborator Author

gjmulder Mar 24, 2023 Collaborator

j-f1 Mar 24, 2023 Collaborator

ggerganov Mar 25, 2023 Maintainer

gjmulder Mar 26, 2023 Collaborator

ggerganov Mar 26, 2023 Maintainer

gjmulder Mar 26, 2023 Collaborator

Test plan for my 16 core AMD Threadripper 1950X for the new few weeks while I'm away:

gjmulder Mar 26, 2023 Collaborator

gjmulder Mar 28, 2023 Collaborator

gjmulder Mar 30, 2023 Collaborator

ivanstepanovftw Apr 9, 2023 Collaborator

gjmulder Apr 9, 2023 Collaborator

Green-Sky
Mar 22, 2023
Collaborator

Mostly Default `./perplexity` settings with all of `wiki.test.raw`

Context sizes: `(512 | 1024 | 2048) ⨯ (7B | 13B | 30B | 65B) ⨯ (llama | alpaca[-lora] | vicuna-GPTQ)` models, first 406 lines of `wiki.test.raw`:

Replies: 57 comments 161 replies

Green-Sky
Mar 22, 2023
Collaborator Author

glinscott Mar 22, 2023
Collaborator

glinscott Mar 22, 2023
Collaborator

Green-Sky Mar 22, 2023
Collaborator Author

glinscott Mar 22, 2023
Collaborator

glinscott Mar 23, 2023
Collaborator

glinscott Mar 23, 2023
Collaborator

gjmulder
Mar 23, 2023
Collaborator

gjmulder Mar 23, 2023
Collaborator

Green-Sky Mar 24, 2023
Collaborator Author

gjmulder
Mar 24, 2023
Collaborator

Green-Sky Mar 24, 2023
Collaborator Author

Green-Sky Mar 24, 2023
Collaborator Author

gjmulder Mar 24, 2023
Collaborator

Green-Sky Mar 24, 2023
Collaborator Author

gjmulder Mar 26, 2023
Collaborator

Green-Sky Mar 24, 2023
Collaborator Author

Green-Sky Mar 24, 2023
Collaborator Author

ggerganov Mar 24, 2023
Maintainer

glinscott Mar 24, 2023
Collaborator

Green-Sky Mar 24, 2023
Collaborator Author

Green-Sky Mar 24, 2023
Collaborator Author

gjmulder Mar 24, 2023
Collaborator

j-f1 Mar 24, 2023
Collaborator

ggerganov
Mar 25, 2023
Maintainer

gjmulder Mar 26, 2023
Collaborator

ggerganov Mar 26, 2023
Maintainer

gjmulder
Mar 26, 2023
Collaborator

gjmulder Mar 26, 2023
Collaborator

gjmulder Mar 28, 2023
Collaborator

gjmulder Mar 30, 2023
Collaborator

ivanstepanovftw Apr 9, 2023
Collaborator

gjmulder Apr 9, 2023
Collaborator