-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama3-8B finetuning shows runtime error of TDRV:v2_cc_execute #658
Comments
It should be fixed on |
The script can run with the Neuron SDK 2.19.1 and the optimum-neuron
|
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
System Info
gives the following error.
Launch the instance with Amazon Linux2023
Install the deps using the following script
Configure Linux for Neuron repository updates
sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com
enabled=1
metadata_expire=0
EOF
sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB
Update OS packages
sudo yum update -y
Install OS headers
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y
Install git
sudo yum install git -y
install Neuron Driver
sudo yum install aws-neuronx-dkms-2.* -y
Install Neuron Runtime
sudo yum install aws-neuronx-collectives-2.* -y
sudo yum install aws-neuronx-runtime-lib-2.* -y
Install Neuron Tools
sudo yum install aws-neuronx-tools-2.* -y
#Create python3 venv
sudo yum install -y libxcrypt-compat
sudo yum install -y gcc-c++
python3 -m venv /home/ec2-user/aws_neuron_venv_pytorch
#Activate venv
source ~/aws_neuron_venv_pytorch/bin/activate
python -m pip install -U pip
Install Jupyter notebook kernel
pip install ipykernel
python3 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
pip install jupyter notebook
pip install environment_kernels
Set pip repository pointing to the Neuron repository
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
Install wget, awscli
python -m pip install wget
python -m pip install awscli
Install Neuron Compiler and Framework
python -m pip install neuronx-cc==2.* torch-neuronx torchvision
#Install optmimum-neuronx
pip3 install --upgrade-strategy eager optimum[neuronx]
Download scripts
git clone https://github.com/huggingface/optimum-neuron.git
cd optimum-neuron/notebooks/text-generation/
Login with your huggingface token ID to download gated models
huggingface-cli login --token YOUR_TOKEN
Create a python3 file download_data.py to download and prcoess dataset under directory optimum-neuron/notebooks/text-generation/:
from datasets import load_dataset
from random import randrange
Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
def format_dolly(sample):
instruction = f"### Instruction\n{sample['instruction']}"
context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
response = f"### Answer\n{sample['response']}"
# join all the parts together
prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
return prompt
from random import randrange
print(format_dolly(dataset[randrange(len(dataset))]))
from transformers import AutoTokenizer
Hugging Face model id
model_id = "meta-llama/Meta-Llama-3-8B" # gated
model_id = "meta-llama/Llama-2-7b-hf" # gated
tokenizer = AutoTokenizer.from_pretrained(model_id)
from random import randint
add utils method to path for loading dataset
import sys
sys.path.append("./scripts/utils") # make sure you change this to the correct path
from pack_dataset import pack_dataset
template dataset to add prompt to each sample
def template_dataset(sample):
sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
return sample
apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
print random sample
print(dataset[randint(0, len(dataset))]["text"])
tokenize dataset
dataset = dataset.map(
lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
)
chunk dataset
lm_dataset = pack_dataset(dataset, chunk_length=2048) # We use 2048 as the maximum length for packing
save train_dataset to disk
dataset_path = "tokenized_dolly"
lm_dataset.save_to_disk(dataset_path)
Run the above script:
python download_data.py
Compile the finetuning script on inf2.8xlarge with the compile_llama3.sh script
MALLOC_ARENA_MAX=64 neuron_parallel_compile torchrun --nproc_per_node=8 scripts/run_clm.py
--model_id "meta-llama/Meta-Llama-3-8B"
--dataset_path "tokenized_dolly"
--bf16 True
--learning_rate 5e-5
--output_dir dolly_llama
--overwrite_output_dir True
--per_device_train_batch_size 1
--gradient_checkpointing True
--tensor_parallel_size 8
--max_steps 10
--logging_steps 10
--gradient_accumulation_steps 16
Run the finetuning on inf2.8xlarge with the run_llama3.sh script
MALLOC_ARENA_MAX=64 torchrun --nproc_per_node=8 scripts/run_clm.py
--model_id "meta-llama/Meta-Llama-3-8B"
--dataset_path "tokenized_dolly"
--bf16 True
--learning_rate 5e-5
--output_dir dolly_llama
--overwrite_output_dir True
--skip_cache_push True
--per_device_train_batch_size 1
--gradient_checkpointing True
--tensor_parallel_size 8
--num_train_epochs 3
--logging_steps 10
--gradient_accumulation_steps 16
The text was updated successfully, but these errors were encountered: