Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add llama 3 support to llm.c #754

Draft
wants to merge 48 commits into
base: master
Choose a base branch
from
Draft

add llama 3 support to llm.c #754

wants to merge 48 commits into from

Commits on Sep 13, 2024

  1. Configuration menu
    Copy the full SHA
    09b47a7 View commit details
    Browse the repository at this point in the history
  2. first set of changes to match up the .py and the .cu version. default…

    … hyperparameters, introduce int+float section of header, read the header and EXIT for now
    karpathy committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    01bc4c6 View commit details
    Browse the repository at this point in the history
  3. change the export code of Llama 3 to be very GPT-2 friendly, using a …

    …combination of 3 hacks. this will make it so that we have to change very little code on the C side
    karpathy committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    b883560 View commit details
    Browse the repository at this point in the history

Commits on Sep 16, 2024

  1. adapt the sizes of all the parameter tensors and load them from file.…

    … so now we are loading all the Llama 3 weights. I verified that the sizes of all the tensors agree with python, and the total number of parameters
    karpathy committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    8866308 View commit details
    Browse the repository at this point in the history
  2. make llama3cu phony

    karpathy committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    45026f6 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    77e1d7a View commit details
    Browse the repository at this point in the history
  4. add new Encoder that does not use positional embeddings, like in llam…

    …a 3. The activations match after encoding. onwards
    karpathy committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    72e6f1a View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    234de31 View commit details
    Browse the repository at this point in the history

Commits on Sep 17, 2024

  1. move debugging into fp32, so python has to write the fp32 version, an…

    …d then we are focusing on the non-cudnn path at first. we're currently right after the first rmsnorm. the encoding right before this matched EXACTLY. but right now, after the first rmsnorm there is already an error of 1e-3 or so, which is highly suspicious so we are looking into it.
    karpathy committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    508c474 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    685617f View commit details
    Browse the repository at this point in the history

Commits on Sep 21, 2024

  1. Configuration menu
    Copy the full SHA
    56f956c View commit details
    Browse the repository at this point in the history

Commits on Sep 22, 2024

  1. DRAFT: Adding backward kernel for repkv

    - [ ] WIP: CPU kernel
    - [ ] Cuda kernel
    insop committed Sep 22, 2024
    Configuration menu
    Copy the full SHA
    45401b4 View commit details
    Browse the repository at this point in the history
  2. CPU version tested

    - [ ] WIP cuda version
    insop committed Sep 22, 2024
    Configuration menu
    Copy the full SHA
    080e57f View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    6c68657 View commit details
    Browse the repository at this point in the history
  4. WIP updating cuda kernel

    insop committed Sep 22, 2024
    Configuration menu
    Copy the full SHA
    ad46043 View commit details
    Browse the repository at this point in the history
  5. minor clean up

    insop committed Sep 22, 2024
    Configuration menu
    Copy the full SHA
    42d09e8 View commit details
    Browse the repository at this point in the history
  6. Add minor change

    insop committed Sep 22, 2024
    Configuration menu
    Copy the full SHA
    fcc3466 View commit details
    Browse the repository at this point in the history

Commits on Sep 24, 2024

  1. wip

    insop committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    de9c817 View commit details
    Browse the repository at this point in the history
  2. integrate the repkv kernel with minor changes. use the bt4c buffer fo…

    …r the replication. rope is next
    karpathy committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    76b40e4 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    026e4ed View commit details
    Browse the repository at this point in the history

Commits on Sep 25, 2024

  1. Configuration menu
    Copy the full SHA
    8336d2a View commit details
    Browse the repository at this point in the history
  2. Add rmsnorm fused kernel

    gordicaleksa committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    2ebf8f6 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    52c7254 View commit details
    Browse the repository at this point in the history
  4. Merge pull request #769 from gordicaleksa/fused_rmsnorm

    Fused rmsnorm reference
    karpathy authored Sep 25, 2024
    Configuration menu
    Copy the full SHA
    6538df6 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    bb3c92d View commit details
    Browse the repository at this point in the history
  6. add swigul yaygit add -u!

    karpathy committed Sep 25, 2024
    1 Configuration menu
    Copy the full SHA
    1826752 View commit details
    Browse the repository at this point in the history
  7. forward pass matchesgit add train_llama3.cu train_llama3.py ! losses …

    …are the same. now comes the backward pass
    karpathy committed Sep 25, 2024
    Configuration menu
    Copy the full SHA
    0731b39 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    8874c2c View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    3e5134d View commit details
    Browse the repository at this point in the history

Commits on Sep 26, 2024

  1. Updated repkv_backward cuda kernel

    - kernel 1 is tested
    
    - build
    ```
    make repkv_backward
    /usr/local/cuda/bin/nvcc -O3 --use_fast_math --generate-code arch=compute_80,code=[compute_80,sm_80] -lcublas -lcublasLt -std=c++17 repkv_backward.cu -o repkv_backward
    ```
    
    - test run on A30
    ```
    Using kernel 1
    Checking block size 32.
    0.531524 0.531524
    0.600285 0.600285
    0.458787 0.458787
    0.296680 0.296680
    -0.911627 -0.911627
    Checking block size 64.
    0.531524 0.531524
    0.600285 0.600285
    0.458787 0.458787
    0.296680 0.296680
    -0.911627 -0.911627
    Checking block size 128.
    0.531524 0.531524
    0.600285 0.600285
    0.458787 0.458787
    0.296680 0.296680
    -0.911627 -0.911627
    Checking block size 256.
    0.531524 0.531524
    0.600285 0.600285
    0.458787 0.458787
    0.296680 0.296680
    -0.911627 -0.911627
    Checking block size 512.
    0.531524 0.531524
    0.600285 0.600285
    0.458787 0.458787
    0.296680 0.296680
    -0.911627 -0.911627
    Checking block size 1024.
    0.531524 0.531524
    0.600285 0.600285
    0.458787 0.458787
    0.296680 0.296680
    -0.911627 -0.911627
    All results match. Starting benchmarks.
    
    block_size   32 time 3.2461 ms
    block_size   64 time 1.7509 ms
    block_size  128 time 1.7374 ms
    block_size  256 time 1.7441 ms
    block_size  512 time 1.8092 ms
    block_size 1024 time 2.0443 ms
    ```
    insop committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    d1f2f64 View commit details
    Browse the repository at this point in the history
  2. add rmsnorm backward in dev/cuda, it seems to work surprisingly, and …

    …is probably ready to be integrated into llmc. we are still using 2X too much shared memory because I didn't want to change way too many things at the same time. I copy pasted our kernel10 of layernorm backward and made tweaks to it removing the bias and mean cool
    karpathy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    31be5e7 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a2b66f1 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    102067f View commit details
    Browse the repository at this point in the history
  5. integrate our rmsnorm backward and move the other rmsnorm functions i…

    …nto rmsnorm.cuh that is a new file
    karpathy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    2c4b3cc View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    cbf53e3 View commit details
    Browse the repository at this point in the history
  7. Update RoPE naming

    insop committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    01c2895 View commit details
    Browse the repository at this point in the history

Commits on Sep 27, 2024

  1. Configuration menu
    Copy the full SHA
    1b54612 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #764 from insop/insop/llama3

    Adding backward kernel for repkv on `llama3` branch (cudamode-irl)
    karpathy authored Sep 27, 2024
    Configuration menu
    Copy the full SHA
    c8b348e View commit details
    Browse the repository at this point in the history
  3. small fixes, but still not too happy with this kernel, it wastes thre…

    …ad and more efficient implementation kernel2 is desireable and desired
    karpathy committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    28e4a7f View commit details
    Browse the repository at this point in the history
  4. just pushing what i have. it's epsilon away from working sigh. basica…

    …lly at this point of where prints happen, gradients match. but once we backward attention, rope and repkv, gradients don't match. attention hasn't changed so that can't be wrong (?), so it's either repkv or rope. i have to go slower and double check the backward pass of both of these in detail. also had to introduce one more additional buffer for backward
    karpathy committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    075e430 View commit details
    Browse the repository at this point in the history
  5. add backward kernel to dev/cuda for rope, to ensure correctness. but …

    …i mean, it's trivial. this can't possibly be the issue. it must be the repkv
    karpathy committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    8d49062 View commit details
    Browse the repository at this point in the history
  6. reshuffle repkv a bit, i wrote it from scratch. the kernel is still c…

    …orrect. repkv backward looks correct. rope backward is trivial so i don't see how it's not correct, and i also checked it. basically i'm really confused right now
    karpathy committed Sep 27, 2024
    1 Configuration menu
    Copy the full SHA
    7d945e9 View commit details
    Browse the repository at this point in the history

Commits on Oct 1, 2024

  1. fix bug with qkvr sizing, has to be 3*C. Credit to @ademeure for find…

    …ing this bug and bringing light to darkness and order to chaos. A true warrior in the fight against entropy.
    karpathy committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    e6481b6 View commit details
    Browse the repository at this point in the history
  2. ok the full backward now shows max abs diff of 3e-3, except for the e…

    …ncoder backward (that's coming next). i think 3e-3 seems ok just inspecting the differences manually. probably this is correct. encoder backward next
    karpathy committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    9099a0a View commit details
    Browse the repository at this point in the history
  3. take out debugging stuff. we can now run training loop for both model…

    …s. they don't match yet
    karpathy committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    c746e06 View commit details
    Browse the repository at this point in the history
  4. BF16 opt state (m/v) with stochastic rounding, seems to work really w…

    …ell (OPTIMIZER_LOW_PRECISION=1)
    ademeure committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    2602b46 View commit details
    Browse the repository at this point in the history
  5. Merge pull request #772 from ademeure/llama3_arun_new

    BF16 opt state (m/v) with stochastic rounding (Llama3 branch)
    karpathy authored Oct 1, 2024
    Configuration menu
    Copy the full SHA
    d808d78 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    2c5ced6 View commit details
    Browse the repository at this point in the history