-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory when quantizing llama3.1-405b on 80GiBx8 H100 instance #36
Comments
code is below
|
Hey @sfc-gh-zhwang AutoFP8 doesn't have support for quantizing a model that big if you can't fit the whole model in one node memory. We recommend using llm-compressor for this, here is an example https://huggingface.co/neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic#creation import torch
from transformers import AutoTokenizer
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.transformers.compression.helpers import ( # noqa
calculate_offload_device_map,
custom_offload_device_map,
)
recipe = """
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 8
type: float
strategy: channel
dynamic: false
symmetric: true
input_activations:
num_bits: 8
type: float
strategy: token
dynamic: true
symmetric: true
targets: ["Linear"]
"""
model_stub = "meta-llama/Meta-Llama-3.1-405B-Instruct"
model_name = model_stub.split("/")[-1]
device_map = calculate_offload_device_map(
model_stub, reserve_for_hessians=False, num_gpus=8, torch_dtype=torch.float16
)
model = SparseAutoModelForCausalLM.from_pretrained(
model_stub, torch_dtype=torch.float16, device_map=device_map
)
output_dir = f"./{model_name}-FP8-dynamic"
oneshot(
model=model,
recipe=recipe,
output_dir=output_dir,
save_compressed=True,
tokenizer=AutoTokenizer.from_pretrained(model_stub),
) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The text was updated successfully, but these errors were encountered: