Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add llama 3 support to llm.c #754

Draft
wants to merge 48 commits into
base: master
Choose a base branch
from
Draft

add llama 3 support to llm.c #754

wants to merge 48 commits into from

Conversation

karpathy
Copy link
Owner

This branch starts with a copy paste of train_gpt2.cu and test_gpt2.cu, but these two files (and other files) will change to incorporate Llama 3.1 support, before merging back to master.

@karpathy karpathy marked this pull request as draft September 13, 2024 19:41
karpathy and others added 28 commits September 13, 2024 20:44
… hyperparameters, introduce int+float section of header, read the header and EXIT for now
…combination of 3 hacks. this will make it so that we have to change very little code on the C side
… so now we are loading all the Llama 3 weights. I verified that the sizes of all the tensors agree with python, and the total number of parameters
…a 3. The activations match after encoding. onwards
…d then we are focusing on the non-cudnn path at first. we're currently right after the first rmsnorm. the encoding right before this matched EXACTLY. but right now, after the first rmsnorm there is already an error of 1e-3 or so, which is highly suspicious so we are looking into it.
- [ ] WIP: CPU kernel
- [ ] Cuda kernel
- [ ] WIP cuda version
insop and others added 19 commits September 25, 2024 17:40
- kernel 1 is tested

- build
```
make repkv_backward
/usr/local/cuda/bin/nvcc -O3 --use_fast_math --generate-code arch=compute_80,code=[compute_80,sm_80] -lcublas -lcublasLt -std=c++17 repkv_backward.cu -o repkv_backward
```

- test run on A30
```
Using kernel 1
Checking block size 32.
0.531524 0.531524
0.600285 0.600285
0.458787 0.458787
0.296680 0.296680
-0.911627 -0.911627
Checking block size 64.
0.531524 0.531524
0.600285 0.600285
0.458787 0.458787
0.296680 0.296680
-0.911627 -0.911627
Checking block size 128.
0.531524 0.531524
0.600285 0.600285
0.458787 0.458787
0.296680 0.296680
-0.911627 -0.911627
Checking block size 256.
0.531524 0.531524
0.600285 0.600285
0.458787 0.458787
0.296680 0.296680
-0.911627 -0.911627
Checking block size 512.
0.531524 0.531524
0.600285 0.600285
0.458787 0.458787
0.296680 0.296680
-0.911627 -0.911627
Checking block size 1024.
0.531524 0.531524
0.600285 0.600285
0.458787 0.458787
0.296680 0.296680
-0.911627 -0.911627
All results match. Starting benchmarks.

block_size   32 time 3.2461 ms
block_size   64 time 1.7509 ms
block_size  128 time 1.7374 ms
block_size  256 time 1.7441 ms
block_size  512 time 1.8092 ms
block_size 1024 time 2.0443 ms
```
…is probably ready to be integrated into llmc. we are still using 2X too much shared memory because I didn't want to change way too many things at the same time. I copy pasted our kernel10 of layernorm backward and made tweaks to it removing the bias and mean cool
Adding backward kernel for repkv on `llama3` branch (cudamode-irl)
…ad and more efficient implementation kernel2 is desireable and desired
…lly at this point of where prints happen, gradients match. but once we backward attention, rope and repkv, gradients don't match. attention hasn't changed so that can't be wrong (?), so it's either repkv or rope. i have to go slower and double check the backward pass of both of these in detail. also had to introduce one more additional buffer for backward
…i mean, it's trivial. this can't possibly be the issue. it must be the repkv
…orrect. repkv backward looks correct. rope backward is trivial so i don't see how it's not correct, and i also checked it. basically i'm really confused right now
…ing this bug and bringing light to darkness and order to chaos. A true warrior in the fight against entropy.
…ncoder backward (that's coming next). i think 3e-3 seems ok just inspecting the differences manually. probably this is correct. encoder backward next
BF16 opt state (m/v) with stochastic rounding (Llama3 branch)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants