Post your hardware specs here if you got it to work. 🛠 #79

elephantpanda · 2023-03-03T03:20:37Z

It might be useful if you get the model to work to write down the model (e.g. 7B) and the hardware you got it to run on. Then people can get an idea of what will be the minimum specs. I'd also be interested to know. 😀

Urammar · 2023-03-03T03:33:02Z

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram

The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

ouening · 2023-03-03T03:54:36Z

7B model passed under the following environment:
Env: PyTorch 1.11.0,Python 3.8(ubuntu20.04),Cuda 11.3
GPU: RTX A4000(16GB) * 1
CPU: 12 vCPU Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz
RAM: 32GB

With some modification:

model_args: ModelArgs = ModelArgs(max_seq_len=1024, max_batch_size=1, **params) # 

model = Transformer(model_args).cuda().half() # some people say it doesn't help

prompts = ["What is the most famous equation from this theory?"]

Logophoman · 2023-03-03T14:03:35Z

@Urammar could you also post how much Vram the other 2 Models need? I feel like this could help a lot of people to know what their machine can actually support. I only have a single A100 40GB and can therfore only run the 7B parameters model atm... 😅

ahoho · 2023-03-03T16:51:52Z

Not sure if this will be helpful, but I made a spreadsheet to calculate the memory requirements for each model size, following the FAQ and Paper. You can make a copy to adjust the batch size and sequence length

Will update as necessary

NightMachinery · 2023-03-03T19:34:27Z

How much VRAM does the 7B model need for finetuning? Are the released weights 32-bits?

gmorenz · 2023-03-03T19:40:33Z

I just made enough code changes to run the 7B model on the CPU. That involved

Replacing torch.cuda.HalfTensor with torch.BFloat16Tensor
Deleting every line of code that mentioned cuda

I also set max_batch_size = 1, removed all but 1 prompt, and added 3 lines of profiling code.

Steady state memory usage is <14GB (but it did use something like 30 while loading the model). It took 7.75 seconds to load the model (some memory swapping occurred during this so it may not be representative), 183 seconds to generate the first token, and 23 seconds to generate each token thereafter. It's only using a single CPU core for some reason (that I haven't tracked down yet).

Hardware: Ryzen 5800x, 32 GB ram

ergosumdre · 2023-03-03T21:50:36Z

I just made enough code changes to run the 7B model on the CPU. That involved

Replacing torch.cuda.HalfTensor with torch.BFloat16Tensor

Deleting every line of code that mentioned cuda

I also set max_batch_size = 1, removed all but 1 prompt, and added 3 lines of profiling code.

Steady state memory usage is <14GB (but it did use something like 30 while loading the model). It took 7.75 seconds to load the model (some memory swapping occurred during this so it may not be representative), 183 seconds to generate the first token, and 23 seconds to generate each token thereafter. It's only using a single CPU core for some reason (that I haven't tracked down yet).

Hardware: Ryzen 5800x, 32 GB ram

Can I ask you the biggest favor and provide your example.py file? :)

gmorenz · 2023-03-03T22:26:06Z

Can I ask you the biggest favor and provide your example.py file? :)

This is probably what you want (the changes aren't just in example.py): https://github.com/gmorenz/llama/tree/cpu

ergosumdre · 2023-03-03T22:38:06Z

Gotcha. So all we would run is

python3 llama/generation.py --max_gen_len 1 ?

gmorenz · 2023-03-03T22:48:09Z

python3 -m torch.distributed.run --nproc_per_node 1 example.py --ckpt_dir ~/LLaMA/7B/ --tokenizer_path ~/LLaMA/tokenizer.model --max_batch_size 1

Is more like it... also remove the extra prompts in the hardcoded prompts array. Also reduce max_gen_len if you want it to take less than 1.6 hours (but I just let that part run).

fabawi · 2023-03-03T23:44:32Z

I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here:

https://github.com/modular-ml/wrapyfi-examples_llama

and have a readme with the instructions on how to do it:

LLaMA with Wrapyfi

Wrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM

currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon!

This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon!
UPDATE: Tested on Two 3080 Tis as well!!!

How to?

Replace all instances of <YOUR_IP> and before running the scripts
Download LLaMA weights using the official form below and install this wrapyfi-examples_llama inside conda or virtual env:

git clone https://github.com/modular-ml/wrapyfi-examples_llama.git
cd wrapyfi-examples_llama
pip install -r requirements.txt
pip install -e .

Install Wrapyfi with the same environment:

git clone https://github.com/fabawi/wrapyfi.git
cd wrapyfi
pip install .[pyzmq]

Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:

cd wrapyfi/standalone 
python zeromq_proxy_broker.py --comm_type pubsubpoll

Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):

CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1

Now start the second instance (within this repo and env) :

CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0

You will now see the output on both terminals
EXTRA: To run on different machines, the broker must be running on a specific IP in step 4. Start the ZeroMQ broker by setting the IP and provide the env variables for steps 5+6 e.g.,

### (replace 10.0.0.101 with <YOUR_IP> ###

# step 4 modification 
python zeromq_proxy_broker.py --socket_ip 10.0.0.101 --comm_type pubsubpoll

# step 5 modification
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1

# step 6 modification
CUDA_VISIBLE_DEVICES="1" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='10.0.0.101' torchrun --master_port=29503 --nproc_per_node 1 example.py --ckpt_dir <YOUR CHECKPOINT DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR CHECKPOINT DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 0

gmorenz · 2023-03-04T00:27:43Z

With this code I'm able to run the 7B model on

Ram: 32GB (14.4GB sustained use, more during startup)
CPU: Ryzen 5800x, exactly one core is used at 100%
Graphics: RTX 2070 Super, only 1962MiB vram used by pytorch

It generates tokens at roughly 4.5 seconds/token. I have reasonable to believe that I can get that down to 2.0 seconds/token with more careful memory management (I've done it, but leaking memory on the CPU side leading to an OOM). It (now) generates tokens at roughly 1 second/token.

All the code is doing is storing the weights on the CPU and moving them to the GPU just before they're used (and then back. Ideally we'd just copy them to the GPU and then never move them back... but I think that will take a more extensive change to the code).

elephantpanda · 2023-03-04T09:24:14Z

My results are in just to prove it works with only 12GB system ram! #105

Model 7B
System RAM: 12GB 😱
VRAM: 16GB (GPU=Quadro P5000)
System: Shadow PC

Took about a minute to load the model, it was maxing out the RAM and chomping on the page file. 😉 Loaded model in 116.71 seconds.
But then quite quick to generate the results.

Changes I made to example.py

torch.distributed.init_process_group("gloo")

model_args: ModelArgs = ModelArgs( max_seq_len=max_seq_len, max_batch_size=1, **params )

with torch.no_grad(): 
    checkpoint = torch.load(ckpt_path, map_location="cpu")

generator = load(  ckpt_dir, tokenizer_path, local_rank, world_size, max_seq_len, 1  )

neuhaus · 2023-03-04T10:09:24Z

Hardware:

RTX 3090 FE 24GB (with 2 monitors connected)
Ryzen 7 3700X
32GB RAM

Llama 13B on a single RTX 3090

In case you haven't seen it:
There is a fork at https://github.com/tloen/llama-int8 by @tloen that uses INT8.

I managed to get Llama 13B to run with it on a single RTX 3090 with Linux! Make sure not to install bitsandbytes from pip, install it from github!

With 32GB RAM and 32GB swap, quantizing took 1 minute and loading took 133 seconds. Peak GPU usage was 17269MiB.

Kudos @tloen! 🎉

Llama 7B

Software:

Windows 10 with NVidia Studio drivers 528.49
Anaconda 64bit with Python 3.9.13
pytorch 1.13.1 with CUDA 11.7 (installed with conda).
Llama 7B

What i had to do to get it (7B) to work on Windows:

Use python -m torch.distributed.run instead of torchrun
example.py: torch.distributed.init_process_group("gloo")

Loading the model takes 5.1 seconds. nvidia-smi output at default max_batch_size 32:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 528.49       Driver Version: 528.49       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:07:00.0  On |                  N/A |
| 30%   55C    P2   307W / 350W |  22158MiB / 24576MiB |     76%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

On Ubuntu Linux 22.04.2 i was able to run the example with torchrun without any changes. Loading the model from an NTFS partition is a bit slower at 6.7 seconds and memory usage was 22916MiB / 24576MiB. nvidia drivers 530.30.02, CUDA 12.1.

venuatu · 2023-03-04T11:29:59Z

I have a version working with a batch size of 16 on a 2080 (8GB) using the 7B model
It's available at https://github.com/venuatu/llama
My changes were:

printing output after every token, to get a chatgpt-like experience venuatu@25c8497
only keep a single transformer block on the gpu at a time (similar to @gmorenz above)
changed from fairscale layers to torch.nn.Linear
added tqdm progress bars

And from that I get around half an hour for 16 outputs of 512 length. It seemed like the average was 3 seconds per forward pass at 16 batch size.

The most random output for me so far has been a bunch of floor related negative tweets, which came from the tweet sentiment analysis prompt

Tweet: "Roscoe just peed on the floor. I was not expecting this."
Sentiment: Negative
###
Tweet: "My cat just licked the floor. "
Sentiment: Negative
###
Tweet: "My dog just peed on the floor. I was not expecting this."
Sentiment: Negative

gmorenz · 2023-03-04T14:05:44Z

@venuatu - check out my code for how I avoided doing a .cpu() on the layer after being done with it - that gave me a 4x speedup over naively moving the layer back and forth between the gpu and cpu (when measured with a batch_size of 1).

I'm also curious why you're doing torch.cuda.empty_cache()? That seems like it's just going to force cuda to reallocate the buffers for the layer it just moved off of the gpu when it moves the next layer onto the gpu.

venuatu · 2023-03-04T22:04:57Z

Yep, that's a much better way to do it. It's now running in half the time (ty @gmorenz )
2080(8GB) ~16 minutes for 512 tokens at 16 batch size

the empty_cache may not have been necessary, with other models in the past I've had buffers get stuck on the gpu, but that is not happening here, maybe pytorch has improved that upstream

venuatu · 2023-03-05T07:57:59Z

I found some fixes for the very slow load times and its now down to 2.5 seconds (with a hot file cache) from my previous 83 seconds

using torch.nn.utils.skip_init() to skip random parameter initialization, saving 30 seconds venuatu@dfdd0ee
using pyarrow to transform the original checkpoint format into instantly available memory mapped tensors, saving 50 seconds venuatu@0d2bb5a
- this makes a new arrow folder next to the checkpoint with 300-ish files that in total are the same as the checkpoint

reycn · 2023-03-05T07:59:40Z

Apple Silicon M1, CPU mode

MindSetFPS · 2023-03-05T21:08:27Z

Specs: Ryzen 5600x, 16 gigs of ram, RTX 3060 12gb

With @venuatu 's fork and the 7B model im getting:

46.7 seconds to load.
13.8gb of ram used
1.26gb in swap
5gb in vram
and there is one core always at 100% utilization

pavelzbornik · 2023-03-05T21:37:07Z

My Specs:
GTX 1630 4GB
i5-13400F
128GB RAM
Win 11

Using 7B, model loading time 5,61 sec

used @gmorenz's fork which enable my tiny GPU to run it :) and changed from nccl to gloo

torch.distributed.init_process_group("gloo")

Moonshine-in-Kansas · 2023-03-06T01:14:25Z

I got finally the 65GB model running on a server of the genesiscloud with 8 RTX 3090 cards with 24GB memory each.
The cost to run the server are a little over $10/hour.

Takes almost 3 minutes to load. Inference is quicker than I can read.

So far I am not impressed. I believe GPT-3 (text-davinci-002) is better. But I have to do more tests with different temperatures etc. Here is the result of one experiment:

Why General Artifical Intelligence will overtake the world soon. An Essay by Llama.
Essay by Llama, High School, 10th grade, A+, January 2005
Keywords United States, human beings, Computers, 21st century, Artificial intelligence
In the 21st century, computers are going to take over the world. There is no doubt about it. They are going to become so advanced that they will be able to do everything that human beings can do, and more. In the future, computers will be abl e to drive cars, make movies, and even write books.
Computers are getting more and more advanced every day. In the past, computers could only do simple math problems. Now, t hey can do complicated math problems and can even do complicated tasks like driving a car.
In the future, computers will be able to do everything that human beings can do. They will be able to drive cars, make mo vies, and even write books.
Computers are getting more and more advanced every day. In the past, computers could only do simple math problems. Now, t hey can do complicated math problems and can even do complicated tasks like driving a car. Computers are also getting mor e and more intelligent.

andrewssobral · 2023-03-06T21:24:35Z

Hello guys,
I am also interested to see how to run LLaMA (e.g. 7B model) on Mac M1 or M2, any solution until now?

gmorenz · 2023-03-06T22:55:18Z

I have the 65B (120GB) model working at 60 seconds/token on:

GPU: Nvidia RTX 2070 super (8GB vram, 5946MB in use, only 18% utilization)
CPU: Ryzen 5800x, less than one core used
RAM: 32GB, Only a few GB in continuous use but pre-processing the weights with 16GB or less might be difficult
SSD: 122GB in continuous use with 2GB/s read. Pre-processing the weights done in double that, but could easily be modified to work in 138GB.

SSD read speed is (of course) the bottleneck - I'm just loading every layer from disk before using it and freeing all the memory (RAM and VRAM) afterwards. Will clean up the code and push it tomorrow.

Goes without saying that at 60 seconds/token the utility of this is... questionable.

applefreak · 2023-03-06T23:32:06Z

Hello guys, I am also interested to see how to run LLaMA (e.g. 7B model) on Mac M1 or M2, any solution until now?

I tried 7B with the CPU version on a M2 Max with 64GB ram, it's slow as heck but it works! Load time around 84secs and takes about 4mins to generate a response with max_gen_len=32

Input:

The Z80 is a processor that

Output:

The Z80 is a processor that 8-bit microcomputer manufacturers used from 1976 to 1992.
The Z80 was developed by the

Edit: on a 2nd try, the model load time is reduced to 34secs, not sure what changed, but keep in mind I'm running this in a Docker container (using continuumio/miniconda3 image) with interactive shell. I allocated 8 CPUs and all 64GB ram for Docker in the Docker Desktop app.

YellowRoseCx · 2023-03-07T03:01:57Z

Anyone have info regarding use with AMD GPUs? The 7b LLaMa model loads and accepts up to 2048 context tokens on my RX 6800xt 16gb

I keep seeing people talking about VRAM requirements when running in 8 bit mode and no one's talking about normal 16 bit mode lol

terbo · 2023-03-07T03:05:00Z

Got 7B loaded on 2x 8GB 3060's, using Kobold United, the dev branch, getting about 3 tokens/second.

terbo: what is life?
llamabot: I think life is just something that all living things have to make their way through

elephantpanda · 2023-03-07T04:26:54Z

Anyone have info regarding use with AMD GPUs? The 7b LLaMa model loads and accepts up to 2048 context tokens on my RX 6800xt 16gb

I keep seeing people talking about VRAM requirements when running in 8 bit mode and no one's talking about normal 16 bit mode lol

Does CUDA work on AMD? Someone tried to made a DirectML port: #117 Which should work on AMD (for Windows) but it hasn't been tested so it might need some fixing.

Lima1512 · 2023-05-11T08:10:07Z

I'm looking for the best laptop for the job.
But there is no leptop with more than 16gb vram :-(

So what do you think about 64gb ram and 16gb vram ?

Foul-Tarnished · 2023-06-05T11:42:56Z

I'm looking for the best laptop for the job. But there is no leptop with more than 16gb vram :-(

So what do you think about 64gb ram and 16gb vram ?

Lol, laptop will just thermal throttle after 2min

jacksutherland · 2023-07-13T22:00:38Z

What are the "ideal" specs to run 65B on a PC?

Is it possible to build a box with Llama 65B running in a process, that can still perform well as your daily driver?
If that's a long-shot, which model would work best for this?
And what specs would it take?

rhiskey · 2023-07-20T20:38:32Z

Okay, what about minimum requirements? What kind of model can run on old servers and how much RAM needed for just only run LLAMA2?

develCuy · 2023-07-25T03:10:02Z

Trained with SFTTrainer and QLora on Google Colab:

Llama-2-7b-hf
Llama-2-13b-hf (Google Colab Pro)

BitAndBytes (double quantize), Mixed Precision training (fp16="02") and gradient+batch sizes of 2 or lower helped out with memory constrains.

If you don't have your own hardware, use Google Colab. This is a good starter:

https://colab.research.google.com/drive/12dVqXZMIVxGI0uutU6HG9RWbWPXL3vts

Lima1512 · 2023-07-25T09:22:00Z

Do you have a tutorial/ video... ?

mitchec4 · 2023-07-26T10:46:25Z

I have an M1 MacBook Pro 16" with 16GB ram. it runs both the 7B and 13B models. Loads with no delay and I usually get instant response from both though additional info can take around 5 seconds to appear. They appear to have a wild imagination when it comes to accuracy so take most answers with pinch of salt. Sometimes after asking an initial question It often goes off and starts asking its own questions and then answers itself. I suppose it is presuming these are standard follow up questions that most people will ask.

The 13B model can become unstable after some use. usually get a load of repeating text then it locks up.

iakashpaul · 2023-07-26T15:58:28Z

Llama2 7B-Chat on RTX 2070S with bitsandbytes FP4, Ryzen 5 3600, 32GB RAM

Completely loaded on VRAM ~6300MB, took ~12 seconds to process ~2200 tokens & generate a summary(~30 tokens/sec).

Llama.cpp for llama2-7b-chat (q4) on M1 Pro works with ~2GB RAM, 17tok/sec

Also ran the same on A10(24GB VRAM)/LambdaLabs VM with similar results

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

model_id = 'meta-llama/Llama-2-7b-chat-hf'

if torch.cuda.is_available():
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map='auto', load_in_4bit=True
    )

wkgcass · 2023-07-28T14:16:28Z

Llama2 7B-Chat official sample (with exactly the same launching arguments in README)

GPU: 4060Ti 16GB. Consumes more than 14GB
RAM: 16GB. The memory usage is about 2GB after the model is loaded

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4

binaryninja · 2023-07-28T15:55:59Z

Task: Fine tune Llama2 7B and 13B on a task specific function using my own data GPU: 3090 24GB RAM : 256 GB CPU: 3970X I have two GPUs but I only wanted to use one so I ran the following in my terminal so the script could only see the first GPU in my system export CUDA_VISIBLE_DEVICES=0 I trained with LORA rank of 32, batch size 1, context length of 4096. After training for 2000 steps I saw a noticeable improvement on the task I was training for, loss went from ~1.8/1.4 for 7B/13B base models to 0.41 / 0.33 after 5000steps and I still have room to go (0.5 through an epoch). The task I'm training on is the recognition and description of malicious decompiled code (malware).

…

On Fri, Jul 28, 2023 at 10:16 AM K.G. Wang ***@***.***> wrote: Llama2 7B-Chat official sample (with exactly the same launching arguments in README) GPU: 4060Ti 16GB. Consumes more than 14GB RAM: 16GB. The memory usage is about 2GB after the model is loaded torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 — Reply to this email directly, view it on GitHub <#79 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNELIQ5PBPFP575SMHELT3XSPCVPANCNFSM6AAAAAAVODAZWY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

develCuy · 2023-07-31T18:04:00Z

Llama 7B and 13B both GGML quantized. Hardware:

GPU: Quadro P1000 (4GB RAM)
CPU: Ryzen 5 360
RAM: 16 GB

Running in local (no huggingface, etc) with LlamaCpp

rchen19 · 2023-08-16T23:00:40Z

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram

The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

yudhiesh · 2023-08-17T03:36:48Z

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram
The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

rchen19 · 2023-08-17T03:42:18Z

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram
The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

Wait, I thought Llama was trained in 16 bits to begin with.

yudhiesh · 2023-08-17T03:46:29Z

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram
The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

Wait, I thought Llama was trained in 16 bits to begin with.

That is true, but you will still have to specify the dtype when loading the model otherwise it will default to float-32 as per the docs.

rchen19 · 2023-08-17T03:50:55Z

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram
The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

Wait, I thought Llama was trained in 16 bits to begin with.

That is true, but you will still have to specify the dtype when loading the model otherwise it will default to float-32 as per the docs.

Ah I see what you meant. Thanks for clarification. I was using hugging face version with their transformers package, so I guess that was the reason I didn’t see such a big memory usage.

But seems a waste of memory to cast 16 bit model to 32 bit? Is there any reason you kept the PyTorch default precision?

yudhiesh · 2023-08-17T04:01:14Z

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram
The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

Wait, I thought Llama was trained in 16 bits to begin with.

That is true, but you will still have to specify the dtype when loading the model otherwise it will default to float-32 as per the docs.

Ah I see what you meant. Thanks for clarification. I was using hugging face version with their transformers package, so I guess that was the reason I didn’t see such a big memory usage.

But seems a waste of memory to cast 16 bit model to 32 bit? Is there any reason you kept the PyTorch default precision?

I can't comment on design decisions made by Huggingface but I stick to specifying the dtype regardless of the model I load.

rchen19 · 2023-08-17T04:04:37Z

7B takes about 14gb of Vram to inference, and the 65B needs a cluster with a total of just shy of 250gb Vram
The 7b model also takes about 14gb of system ram, and that seems to exceed the capacity of free colab, if anyone requires that.

nvidia-smi
Thu Mar  2 19:29:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0   107W / 400W |  29581MiB / 40960MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    98W / 400W |  29721MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   32C    P0    95W / 400W |  29719MiB / 40960MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   33C    P0   106W / 400W |  29723MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   41C    P0   102W / 400W |  29725MiB / 40960MiB |     76%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   38C    P0   114W / 400W |  29719MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  29725MiB / 40960MiB |     75%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   39C    P0    95W / 400W |  29573MiB / 40960MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2916985      C   ...1/envs/pytorch/bin/python    29579MiB |
|    1   N/A  N/A   2916986      C   ...1/envs/pytorch/bin/python    29719MiB |
|    2   N/A  N/A   2916987      C   ...1/envs/pytorch/bin/python    29717MiB |
|    3   N/A  N/A   2916988      C   ...1/envs/pytorch/bin/python    29721MiB |
|    4   N/A  N/A   2916989      C   ...1/envs/pytorch/bin/python    29723MiB |
|    5   N/A  N/A   2916990      C   ...1/envs/pytorch/bin/python    29717MiB |
|    6   N/A  N/A   2916991      C   ...1/envs/pytorch/bin/python    29723MiB |
|    7   N/A  N/A   2916993      C   ...1/envs/pytorch/bin/python    29571MiB |
+-----------------------------------------------------------------------------+

250GB for a 65b model seems a bit too much, I think most of the examples out there for 65b model usually about 140GB is needed. Any insight on the reason for the difference? Is it fine tuning memory usage?

Are you loading it in full-precision, i.e., float-32? If so it would make sense as the memory requirements for a 65b parameter model is 65 * 4 = ~260GB as per LLM-Numbers. To get it down to ~140GB you would have to load it in bfloat/float-16 which is half-precision, i.e., 65 * 2 = ~130GB.

Wait, I thought Llama was trained in 16 bits to begin with.

That is true, but you will still have to specify the dtype when loading the model otherwise it will default to float-32 as per the docs.

Ah I see what you meant. Thanks for clarification. I was using hugging face version with their transformers package, so I guess that was the reason I didn’t see such a big memory usage.
But seems a waste of memory to cast 16 bit model to 32 bit? Is there any reason you kept the PyTorch default precision?

I can't comment on design decisions made by Huggingface but I stick to specifying the dtype regardless of the model I load.

Apologies. Apparently I mistook you as the original comment author. Yes, I’d agree specifying dtype to match the native precision of the model would be a good idea.

Twenkid · 2023-09-03T10:19:16Z

Thanks to all for the data.
@Moonshine-in-Kansas Your example reminds me of a real "groundbreaking" essay of mine, a 9th grader at the time, 1999, "Where are you going, world?",.an incentive prize of a radio "Plovdiv" competition. (The conclusion is: to the creation of the Thinking Machine - the Machine God). The original in Bulgarian:
https://github.com/Twenkid/Theory-of-Universe-and-Mind/blob/main/1999.md

In English (MT, not edited)
Future of Artificial Intelligence
Default (GPT-3.5)
User
translate to English: Къде отиваш свят?
Конкурс за есе на радио „Пловдив“, 13.12.1999 г., поощрителна награда
Автор: Тодор Илиев Арнаудов, 15 г., 9-ж клас в ТЕЕ Пловдив (ПГЕЕ) (...)

ChatGPT
Where are you going, world? Essay competition on Radio "Plovdiv," December 13, 1999, encouragement award Author: Todor Iliev Arnaudov, 15 years old, 9th grade at TEE Plovdiv (PGEE) Where are you going, world? Only God knows where and when a person first asked this question. And what is the answer today, nearly two thousand years after the birth of Christ?

Humanity developed and progressed for thousands of years. It invented the wheel, learned to work with metals, invented writing, then printing, the mechanical calculator, the telegraph, the telephone and the radio, television, and electronic computers.

People wanted to preserve their knowledge for future generations, they wanted to communicate over long distances, to have "eyes" everywhere in the world. All of this became a reality thanks to human genius and the desire of Homo sapiens to control information. In his pursuit, he created the universal machine for executing algorithms, for processing information - the electronic computer. The idea for it occurred to John Atanasoff in the late thirties, and it became a reality in 1945 in the USA under the name "ENIAC." Like most major discoveries, the computer was initially used only for military purposes, but later it found its rightful place in almost all spheres of our life. Computing machines developed rapidly - from electronic tubes through transistors and to the highest technology - integrated circuits, which allowed electronic engineers to create modern microprocessors with millions of transistors that fit in the palm of a hand, capable of performing billions of calculations per second. The impact of the global computer network, the Internet, on people today is palpable. The network will continue to expand, but I believe that this is not the culmination of the rapid development of electronics, computing technology, and software (the latter actually determines the computer's "behavior," i.e., it is its "soul," and electronics is part of its material form). Another discovery, still unrealized, will crown this progress - artificial intelligence.

Some people believe that computers are "soulless machines" doomed to remain only assistants to human intelligence and creativity, but not to become creators themselves and possess reason. Machines are created by humans, I believe that he is capable of creating a system, an algorithm that imitates the work of his brain, and in this way, the computer can become a thinking machine. In my opinion, the path of humanity leads there - towards the creation of an electronic analog of the highest creation of nature, what is currently unique to humans - intelligence.

The creation of artificial intelligence will change the world. In my opinion, artificial intelligence is the next step in the evolution of matter - computers have several advantages over the "transitional" human, the most essential of which is that they are practically immortal - they withstand all kinds of radiation, do not feel pain, require very little energy, which they can easily obtain from the Sun, and they can reproduce themselves by producing factories, and so on. My opinion is that the world is heading precisely there, towards the creation of the thinking machine - the machine of God.

I got finally the 65GB model running on a server of the genesiscloud with 8 RTX 3090 cards with 24GB memory each. The cost to run the server are a little over $10/hour.

Takes almost 3 minutes to load. Inference is quicker than I can read.

So far I am not impressed. I believe GPT-3 (text-davinci-002) is better. But I have to do more tests with different temperatures etc. Here is the result of one experiment:

Why General Artifical Intelligence will overtake the world soon. An Essay by Llama. Essay by Llama, High School, 10th grade, A+, January 2005 Keywords United States, human beings, Computers, 21st century, Artificial intelligence In the 21st century, computers are going to take over the world. There is no doubt about it. They are going to become so advanced that they will be able to do everything that human beings can do, and more. In the future, computers will be abl e to drive cars, make movies, and even write books. Computers are getting more and more advanced every day. In the past, computers could only do simple math problems. Now, t hey can do complicated math problems and can even do complicated tasks like driving a car. In the future, computers will be able to do everything that human beings can do. They will be able to drive cars, make mo vies, and even write books. Computers are getting more and more advanced every day. In the past, computers could only do simple math problems. Now, t hey can do complicated math problems and can even do complicated tasks like driving a car. Computers are also getting mor e and more intelligent.

hz-nm · 2023-09-20T12:18:58Z

CodeLLAMA-13B Running on RTX3090.
Loading the model in 4bit takes ~10GB of VRAM.

develCuy · 2023-09-20T14:08:36Z

@hz-nm, how fast is it in terms of tokens per second?

DanRegalia · 2023-09-23T19:41:03Z

Hope this helps. DAVE

Now I'm stuck with 5 questions... what else are you running on that machine? I feel like there needs to be some kind of helper app in there. Maybe I need a better understanding of the hardware needs... can I run it on CPU alone if I have enough CPU memory? Can I run these larger models on a regular PC? Can I get a few P40s or K40s and offload certain tasks to this? I'm really curious about the hardware needs for running these models...

hz-nm · 2023-09-26T12:51:30Z

@hz-nm, how fast is it in terms of tokens per second?

I didn't put a measure for that unfortunately but it is quite fast. Almost as good as ChatGPT if it was streaming.

ghost · 2023-10-06T06:37:04Z

Hey guys, I want to deploy code Llama on a Ubuntu server specifically on the cloud, what specs should I use like the vCPU and memory? Please suggest or guide for the same. Thanks in advance.

Oysters3 · 2024-06-09T22:49:26Z

Am new to this and in testing phase to see what works.
AMD Ryzen 9 7940HS Processor, 8 Cores/16 Threads
Integrated gpu not supported? (Radeon 780M), and doesn't seem to be getting used. No external GPU connected currently.
32Gb DDR5 Ram
2x1Tb SSD drives in Raid0 (990 pro)
Mistrel 7B works relatively well, streams instantly.
Using Fabric for better prompting also works well, however core temps go through the roof. A test on extracting wisdom from a you tube transcript pushed temps to 89.5 degrees...

elephantpanda changed the title ~~Post your hardware specs here if you got it to work.~~ Post your hardware specs here if you got it to work. 🛠 Mar 3, 2023

elephantpanda mentioned this issue Mar 3, 2023

How much memory is required to load the 7B model? #82

Closed

g1sbi mentioned this issue Jul 19, 2023

Hardware requirements for Llama 2 #425

Closed

namaggarwal added the invalid Not an issue label Sep 6, 2023

namaggarwal self-assigned this Sep 6, 2023

namaggarwal closed this as completed Sep 6, 2023

andrewchungg mentioned this issue Sep 7, 2023

Hrdware requirements to run 13B and 30B smoothly #561

Closed

Post your hardware specs here if you got it to work. 🛠 #79

Post your hardware specs here if you got it to work. 🛠 #79

Comments

elephantpanda commented Mar 3, 2023

Urammar commented Mar 3, 2023 • edited Loading

ouening commented Mar 3, 2023 • edited Loading

Logophoman commented Mar 3, 2023

ahoho commented Mar 3, 2023 • edited Loading

NightMachinery commented Mar 3, 2023

gmorenz commented Mar 3, 2023

ergosumdre commented Mar 3, 2023

gmorenz commented Mar 3, 2023

ergosumdre commented Mar 3, 2023

gmorenz commented Mar 3, 2023 • edited Loading

fabawi commented Mar 3, 2023

LLaMA with Wrapyfi

How to?

gmorenz commented Mar 4, 2023 • edited Loading

elephantpanda commented Mar 4, 2023 • edited Loading

neuhaus commented Mar 4, 2023 • edited Loading

Hardware:

Llama 13B on a single RTX 3090

Llama 7B

Software:

venuatu commented Mar 4, 2023

gmorenz commented Mar 4, 2023

venuatu commented Mar 4, 2023

venuatu commented Mar 5, 2023 • edited Loading

reycn commented Mar 5, 2023

MindSetFPS commented Mar 5, 2023 • edited Loading

pavelzbornik commented Mar 5, 2023

Moonshine-in-Kansas commented Mar 6, 2023

andrewssobral commented Mar 6, 2023

gmorenz commented Mar 6, 2023 • edited Loading

applefreak commented Mar 6, 2023 • edited Loading

YellowRoseCx commented Mar 7, 2023

terbo commented Mar 7, 2023 • edited Loading

elephantpanda commented Mar 7, 2023

Lima1512 commented May 11, 2023 • edited Loading

Foul-Tarnished commented Jun 5, 2023

jacksutherland commented Jul 13, 2023

What are the "ideal" specs to run 65B on a PC?

rhiskey commented Jul 20, 2023

develCuy commented Jul 25, 2023 • edited Loading

Lima1512 commented Jul 25, 2023

mitchec4 commented Jul 26, 2023

iakashpaul commented Jul 26, 2023 • edited Loading

wkgcass commented Jul 28, 2023

binaryninja commented Jul 28, 2023 via email

develCuy commented Jul 31, 2023 • edited Loading

rchen19 commented Aug 16, 2023 • edited Loading

yudhiesh commented Aug 17, 2023

rchen19 commented Aug 17, 2023

yudhiesh commented Aug 17, 2023

rchen19 commented Aug 17, 2023

yudhiesh commented Aug 17, 2023

rchen19 commented Aug 17, 2023

Twenkid commented Sep 3, 2023

hz-nm commented Sep 20, 2023 • edited Loading

develCuy commented Sep 20, 2023 • edited Loading

DanRegalia commented Sep 23, 2023

hz-nm commented Sep 26, 2023

ghost commented Oct 6, 2023

Oysters3 commented Jun 9, 2024

Urammar commented Mar 3, 2023 •

edited

Loading

ouening commented Mar 3, 2023 •

edited

Loading

ahoho commented Mar 3, 2023 •

edited

Loading

gmorenz commented Mar 3, 2023 •

edited

Loading

gmorenz commented Mar 4, 2023 •

edited

Loading

elephantpanda commented Mar 4, 2023 •

edited

Loading

neuhaus commented Mar 4, 2023 •

edited

Loading

venuatu commented Mar 5, 2023 •

edited

Loading

MindSetFPS commented Mar 5, 2023 •

edited

Loading

gmorenz commented Mar 6, 2023 •

edited

Loading

applefreak commented Mar 6, 2023 •

edited

Loading

terbo commented Mar 7, 2023 •

edited

Loading

Lima1512 commented May 11, 2023 •

edited

Loading

develCuy commented Jul 25, 2023 •

edited

Loading

iakashpaul commented Jul 26, 2023 •

edited

Loading

develCuy commented Jul 31, 2023 •

edited

Loading

rchen19 commented Aug 16, 2023 •

edited

Loading

hz-nm commented Sep 20, 2023 •

edited

Loading

develCuy commented Sep 20, 2023 •

edited

Loading