Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. #1716

Closed
dreamkillers666 opened this issue Dec 3, 2023 · 5 comments
Labels
solved This problem has been already solved

Comments

@dreamkillers666
Copy link

D:\softwares\Anaconda3\envs\zh\python.exe E:\ZH\PycharmProjects\LLaMA-Factory\src\train_web.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin D:\softwares\Anaconda3\envs\zh\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll
CUDA SETUP: CUDA runtime path found: D:\softwares\Anaconda3\envs\zh\bin\cudart64_110.dll
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary D:\softwares\Anaconda3\envs\zh\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll...
D:\softwares\Anaconda3\envs\zh\lib\site-packages\transformers\deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
D:\softwares\Anaconda3\envs\zh\lib\site-packages\trl\trainer\ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead.
warnings.warn(
Running on local URL: http://0.0.0.0:7860

To create a public link, set share=True in launch().
[INFO|training_args.py:1332] 2023-12-03 16:39:01,601 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1764] 2023-12-03 16:39:01,601 >> PyTorch: setting up devices
12/03/2023 16:39:01 - WARNING - llmtuner.model.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training.
12/03/2023 16:39:01 - INFO - llmtuner.model.parser - Process rank: 0, device: cuda:0, n_gpu: 1
distributed training: True, compute dtype: torch.float16
12/03/2023 16:39:01 - INFO - llmtuner.model.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=True,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=8,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=saves\Baichuan-13B-Base\lora\baichuan-lora-sft\runs\Dec03_16-39-01_DESKTOP-VV27F79,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=5,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=2.0,
optim=adamw_torch,
optim_args=None,
output_dir=saves\Baichuan-13B-Base\lora\baichuan-lora-sft,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=4,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=saves\Baichuan-13B-Base\lora\baichuan-lora-sft,
save_on_each_node=False,
save_safetensors=False,
save_steps=100,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
D:\softwares\Anaconda3\envs\zh\lib\site-packages\transformers\training_args.py:1677: FutureWarning: --push_to_hub_token is deprecated and will be removed in version 5 of 🤗 Transformers. Use --hub_token instead.
warnings.warn(
12/03/2023 16:39:01 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json...
12/03/2023 16:39:01 - WARNING - llmtuner.data.utils - Checksum failed: mismatched SHA-1 hash value at E:\ZH\PycharmProjects\LLaMA-Factory\data\alpaca_gpt4_data_zh.json.
Using custom data configuration default-2e9a277fc6a494f2
Loading Dataset Infos from D:\softwares\Anaconda3\envs\zh\lib\site-packages\datasets\packaged_modules\json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from C:\Users\Administrator.cache\huggingface\datasets/json/default-2e9a277fc6a494f2/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Found cached dataset json (C:/Users/Administrator/.cache/huggingface/datasets/json/default-2e9a277fc6a494f2/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading Dataset info from C:/Users/Administrator/.cache/huggingface/datasets/json/default-2e9a277fc6a494f2/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
12/03/2023 16:39:03 - INFO - llmtuner.data.loader - Loading dataset self_cognition.json...
12/03/2023 16:39:03 - WARNING - llmtuner.data.utils - Checksum failed: mismatched SHA-1 hash value at E:\ZH\PycharmProjects\LLaMA-Factory\data\self_cognition.json.
Using custom data configuration default-e1b11482e194d36c
Loading Dataset Infos from D:\softwares\Anaconda3\envs\zh\lib\site-packages\datasets\packaged_modules\json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from C:\Users\Administrator.cache\huggingface\datasets/json/default-e1b11482e194d36c/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Found cached dataset json (C:/Users/Administrator/.cache/huggingface/datasets/json/default-e1b11482e194d36c/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading Dataset info from C:/Users/Administrator/.cache/huggingface/datasets/json/default-e1b11482e194d36c/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
12/03/2023 16:39:04 - INFO - llmtuner.data.loader - Loading dataset sharegpt_zh_27k.json...
12/03/2023 16:39:04 - WARNING - llmtuner.data.utils - Checksum failed: mismatched SHA-1 hash value at E:\ZH\PycharmProjects\LLaMA-Factory\data\sharegpt_zh_27k.json.
Using custom data configuration default-fc66647fa583c5bc
Loading Dataset Infos from D:\softwares\Anaconda3\envs\zh\lib\site-packages\datasets\packaged_modules\json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from C:\Users\Administrator.cache\huggingface\datasets/json/default-fc66647fa583c5bc/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Found cached dataset json (C:/Users/Administrator/.cache/huggingface/datasets/json/default-fc66647fa583c5bc/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading Dataset info from C:/Users/Administrator/.cache/huggingface/datasets/json/default-fc66647fa583c5bc/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Some of the datasets have disparate format. Resetting the format of the concatenated dataset.
[INFO|tokenization_utils_base.py:1852] 2023-12-03 16:39:06,843 >> loading file tokenizer.model from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\tokenizer.model
[INFO|tokenization_utils_base.py:1852] 2023-12-03 16:39:06,843 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1852] 2023-12-03 16:39:06,843 >> loading file special_tokens_map.json from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\special_tokens_map.json
[INFO|tokenization_utils_base.py:1852] 2023-12-03 16:39:06,843 >> loading file tokenizer_config.json from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\tokenizer_config.json
[INFO|configuration_utils.py:715] 2023-12-03 16:39:07,407 >> loading configuration file config.json from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\config.json
[INFO|configuration_utils.py:715] 2023-12-03 16:39:07,877 >> loading configuration file config.json from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\config.json
[INFO|configuration_utils.py:775] 2023-12-03 16:39:07,878 >> Model config BaichuanConfig {
"_from_model_config": true,
"_name_or_path": "baichuan-inc/Baichuan-13B-Base",
"architectures": [
"BaichuanForCausalLM"
],
"auto_map": {
"AutoConfig": "baichuan-inc/Baichuan-13B-Base--configuration_baichuan.BaichuanConfig",
"AutoModelForCausalLM": "baichuan-inc/Baichuan-13B-Base--modeling_baichuan.BaichuanForCausalLM"
},
"bos_token_id": 1,
"eos_token_id": 2,
"gradient_checkpointing": [
false
],
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13696,
"model_max_length": 4096,
"model_type": "baichuan",
"num_attention_heads": 40,
"num_hidden_layers": 40,
"pad_token_id": 0,
"rms_norm_eps": 1e-06,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.33.0",
"use_cache": true,
"vocab_size": 64000
}

12/03/2023 16:39:07 - INFO - llmtuner.model.loader - Quantizing model to 4 bit.
[INFO|modeling_utils.py:2857] 2023-12-03 16:39:12,000 >> loading weights file pytorch_model.bin from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\pytorch_model.bin.index.json
[INFO|modeling_utils.py:1200] 2023-12-03 16:39:12,006 >> Instantiating BaichuanForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:768] 2023-12-03 16:39:12,009 >> Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "4.33.0"
}

[INFO|modeling_utils.py:2971] 2023-12-03 16:39:13,563 >> Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 100%|██████████| 3/3 [00:40<00:00, 13.45s/it]
[INFO|modeling_utils.py:3643] 2023-12-03 16:39:54,020 >> All model checkpoint weights were used when initializing BaichuanForCausalLM.

[INFO|modeling_utils.py:3651] 2023-12-03 16:39:54,020 >> All the weights of BaichuanForCausalLM were initialized from the model checkpoint at baichuan-inc/Baichuan-13B-Base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BaichuanForCausalLM for predictions without further training.
[INFO|configuration_utils.py:730] 2023-12-03 16:39:54,295 >> loading configuration file generation_config.json from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\generation_config.json
[INFO|configuration_utils.py:768] 2023-12-03 16:39:54,295 >> Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "4.33.0"
}

12/03/2023 16:39:55 - INFO - llmtuner.model.utils - Upcasting weights in layernorm in float32.
12/03/2023 16:39:55 - INFO - llmtuner.model.utils - Gradient checkpointing enabled.
12/03/2023 16:39:55 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
[INFO|tokenization_utils_base.py:926] 2023-12-03 16:39:56,067 >> Assigning [] to the additional_special_tokens key of the tokenizer
12/03/2023 16:39:56 - INFO - llmtuner.model.loader - trainable params: 6553600 || all params: 13271454720 || trainable%: 0.0494
Loading cached processed dataset at C:\Users\Administrator.cache\huggingface\datasets\json\default-2e9a277fc6a494f2\0.0.0\8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96\cache-d211757e67af8dbe.arrow
input_ids:
[703, 9805, 1493, 650, 15695, 3475, 680, 755, 1313, 9715, 945, 5840, 9091, 79, 776, 9091, 4278, 8922, 31125, 7320, 31125, 680, 1265, 952, 8943, 678, 656, 3475, 31155, 31114, 3155, 79, 31106, 5, 5132, 31143, 31106, 4550, 19463, 7841, 7868, 73, 5, 7905, 18056, 31143, 31106, 4567, 31161, 4550, 19463, 7841, 7868, 77, 5, 5, 53, 79, 31106, 4550, 3606, 2148, 73, 3526, 31345, 11886, 31135, 3606, 4467, 72, 31248, 32188, 31583, 76, 21332, 31399, 17268, 72, 31196, 6520, 28165, 2337, 72, 7552, 12421, 6029, 72, 31404, 20387, 5972, 16573, 73, 5, 5, 54, 79, 31106, 24691, 9945, 73, 3526, 11164, 12420, 31135, 11748, 76, 11603, 76, 31233, 32570, 31368, 31188, 12019, 13443, 31664, 31135, 18085, 6768, 72, 6076, 31229, 32242, 76, 31229, 12019, 31188, 10523, 6186, 72, 31187, 4550, 19463, 9945, 6269, 73, 5, 5, 55, 79, 31106, 11923, 15932, 73, 11923, 31209, 7776, 2337, 31475, 31262, 2462, 72, 17951, 3526, 31363, 6196, 31106, 59, 31136, 60, 31106, 4237, 31135, 11923, 73, 9636, 11923, 20387, 17832, 6550, 72, 6520, 3606, 6691, 72, 31404, 3806, 3300, 22645, 9684, 31258, 73, 2]
[INFO|training_args.py:1332] 2023-12-03 16:39:57,215 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1764] 2023-12-03 16:39:57,216 >> PyTorch: setting up devices
inputs:
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: 保持健康的三个提示。
Assistant: 以下是保持健康的三个提示:

  1. 保持身体活动。每天做适当的身体运动,如散步、跑步或游泳,能促进心血管健康,增强肌肉力量,并有助于减少体重。

  2. 均衡饮食。每天食用新鲜的蔬菜、水果、全谷物和脂肪含量低的蛋白质食物,避免高糖、高脂肪和加工食品,以保持健康的饮食习惯。

  3. 睡眠充足。睡眠对人体健康至关重要,成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力,促进身体恢复,并提高注意力和记忆力。
    label_ids:
    [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 31106, 4567, 31161, 4550, 19463, 7841, 7868, 77, 5, 5, 53, 79, 31106, 4550, 3606, 2148, 73, 3526, 31345, 11886, 31135, 3606, 4467, 72, 31248, 32188, 31583, 76, 21332, 31399, 17268, 72, 31196, 6520, 28165, 2337, 72, 7552, 12421, 6029, 72, 31404, 20387, 5972, 16573, 73, 5, 5, 54, 79, 31106, 24691, 9945, 73, 3526, 11164, 12420, 31135, 11748, 76, 11603, 76, 31233, 32570, 31368, 31188, 12019, 13443, 31664, 31135, 18085, 6768, 72, 6076, 31229, 32242, 76, 31229, 12019, 31188, 10523, 6186, 72, 31187, 4550, 19463, 9945, 6269, 73, 5, 5, 55, 79, 31106, 11923, 15932, 73, 11923, 31209, 7776, 2337, 31475, 31262, 2462, 72, 17951, 3526, 31363, 6196, 31106, 59, 31136, 60, 31106, 4237, 31135, 11923, 73, 9636, 11923, 20387, 17832, 6550, 72, 6520, 3606, 6691, 72, 31404, 3806, 3300, 22645, 9684, 31258, 73, 2]
    labels:
    以下是保持健康的三个提示:

  4. 保持身体活动。每天做适当的身体运动,如散步、跑步或游泳,能促进心血管健康,增强肌肉力量,并有助于减少体重。

  5. 均衡饮食。每天食用新鲜的蔬菜、水果、全谷物和脂肪含量低的蛋白质食物,避免高糖、高脂肪和加工食品,以保持健康的饮食习惯。

  6. 睡眠充足。睡眠对人体健康至关重要,成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力,促进身体恢复,并提高注意力和记忆力。
    [INFO|trainer.py:403] 2023-12-03 16:39:57,238 >> The model is quantized. To train this model you need to add additional modules inside the model such as adapters using peft library and freeze the model weights. Please check the examples in https://github.com/huggingface/peft for more details.
    [INFO|trainer.py:1712] 2023-12-03 16:39:58,150 >> ***** Running training *****
    [INFO|trainer.py:1713] 2023-12-03 16:39:58,150 >> Num examples = 75,775
    [INFO|trainer.py:1714] 2023-12-03 16:39:58,150 >> Num Epochs = 2
    [INFO|trainer.py:1715] 2023-12-03 16:39:58,150 >> Instantaneous batch size per device = 4
    [INFO|trainer.py:1718] 2023-12-03 16:39:58,150 >> Total train batch size (w. parallel, distributed & accumulation) = 32
    [INFO|trainer.py:1719] 2023-12-03 16:39:58,150 >> Gradient Accumulation steps = 8
    [INFO|trainer.py:1720] 2023-12-03 16:39:58,150 >> Total optimization steps = 4,736
    [INFO|trainer.py:1721] 2023-12-03 16:39:58,151 >> Number of trainable parameters = 6,553,600
    D:\softwares\Anaconda3\envs\zh\lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
    warnings.warn(
    12/03/2023 16:41:50 - INFO - llmtuner.extras.callbacks - {'loss': 1.6244, 'learning_rate': 5.0000e-05, 'epoch': 0.00}
    {'loss': 1.6244, 'learning_rate': 4.999986249262817e-05, 'epoch': 0.0}
    OMP: Error Train using qlora exist with error  #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
    OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
    C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:2904: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit
    训练一直卡在这不动了。

@hiyouga
Copy link
Owner

hiyouga commented Dec 3, 2023

windows 需要安装特定版本的 bitsandbytes

@hiyouga hiyouga added the solved This problem has been already solved label Dec 3, 2023
@hiyouga hiyouga closed this as completed Dec 3, 2023
@dreamkillers666
Copy link
Author

windows 需要安装特定版本的 bitsandbytes

大佬,安装了特定版本的 bitsandbytes,没有上面的bitsandbytes的BUG REPORT了,但是训练还是一直卡在这不动了。
bin D:\softwares\Anaconda3\envs\zh\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll
D:\softwares\Anaconda3\envs\zh\lib\site-packages\transformers\deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
D:\softwares\Anaconda3\envs\zh\lib\site-packages\trl\trainer\ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead.
warnings.warn(
D:\softwares\Anaconda3\envs\zh\lib\runpy.py:127: RuntimeWarning: 'llmtuner.webui.interface' found in sys.modules after import of package 'llmtuner.webui', but prior to execution of 'llmtuner.webui.interface'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
Running on local URL: http://0.0.0.0:7860

To create a public link, set share=True in launch().
12/04/2023 13:35:46 - WARNING - llmtuner.model.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training.
[INFO|training_args.py:1332] 2023-12-04 13:35:46,721 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1764] 2023-12-04 13:35:46,721 >> PyTorch: setting up devices
D:\softwares\Anaconda3\envs\zh\lib\site-packages\transformers\training_args.py:1677: FutureWarning: --push_to_hub_token is deprecated and will be removed in version 5 of 🤗 Transformers. Use --hub_token instead.
warnings.warn(
12/04/2023 13:35:46 - INFO - llmtuner.model.parser - Process rank: 0, device: cuda:0, n_gpu: 1
distributed training: True, compute dtype: torch.float16
12/04/2023 13:35:46 - INFO - llmtuner.model.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=True,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=8,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=saves\Baichuan-13B-Base\lora\baichuan-lora-sft\runs\Dec04_13-35-46_DESKTOP-VV27F79,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=5,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=2.0,
optim=adamw_torch,
optim_args=None,
output_dir=saves\Baichuan-13B-Base\lora\baichuan-lora-sft,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=4,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=saves\Baichuan-13B-Base\lora\baichuan-lora-sft,
save_on_each_node=False,
save_safetensors=False,
save_steps=100,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
12/04/2023 13:35:46 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json...
12/04/2023 13:35:47 - WARNING - llmtuner.data.utils - Checksum failed: mismatched SHA-1 hash value at E:\ZH\PycharmProjects\LLaMA-Factory\data\alpaca_gpt4_data_zh.json.
Using custom data configuration default-2e9a277fc6a494f2
Loading Dataset Infos from D:\softwares\Anaconda3\envs\zh\lib\site-packages\datasets\packaged_modules\json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from C:\Users\Administrator.cache\huggingface\datasets/json/default-2e9a277fc6a494f2/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Found cached dataset json (C:/Users/Administrator/.cache/huggingface/datasets/json/default-2e9a277fc6a494f2/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading Dataset info from C:/Users/Administrator/.cache/huggingface/datasets/json/default-2e9a277fc6a494f2/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
12/04/2023 13:35:48 - INFO - llmtuner.data.loader - Loading dataset self_cognition.json...
12/04/2023 13:35:48 - WARNING - llmtuner.data.utils - Checksum failed: mismatched SHA-1 hash value at E:\ZH\PycharmProjects\LLaMA-Factory\data\self_cognition.json.
Using custom data configuration default-e1b11482e194d36c
Loading Dataset Infos from D:\softwares\Anaconda3\envs\zh\lib\site-packages\datasets\packaged_modules\json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from C:\Users\Administrator.cache\huggingface\datasets/json/default-e1b11482e194d36c/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Found cached dataset json (C:/Users/Administrator/.cache/huggingface/datasets/json/default-e1b11482e194d36c/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading Dataset info from C:/Users/Administrator/.cache/huggingface/datasets/json/default-e1b11482e194d36c/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
12/04/2023 13:35:49 - INFO - llmtuner.data.loader - Loading dataset sharegpt_zh_27k.json...
12/04/2023 13:35:49 - WARNING - llmtuner.data.utils - Checksum failed: mismatched SHA-1 hash value at E:\ZH\PycharmProjects\LLaMA-Factory\data\sharegpt_zh_27k.json.
Using custom data configuration default-fc66647fa583c5bc
Loading Dataset Infos from D:\softwares\Anaconda3\envs\zh\lib\site-packages\datasets\packaged_modules\json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from C:\Users\Administrator.cache\huggingface\datasets/json/default-fc66647fa583c5bc/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Found cached dataset json (C:/Users/Administrator/.cache/huggingface/datasets/json/default-fc66647fa583c5bc/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading Dataset info from C:/Users/Administrator/.cache/huggingface/datasets/json/default-fc66647fa583c5bc/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Some of the datasets have disparate format. Resetting the format of the concatenated dataset.
[INFO|tokenization_utils_base.py:1852] 2023-12-04 13:35:51,620 >> loading file tokenizer.model from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\tokenizer.model
[INFO|tokenization_utils_base.py:1852] 2023-12-04 13:35:51,620 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1852] 2023-12-04 13:35:51,621 >> loading file special_tokens_map.json from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\special_tokens_map.json
[INFO|tokenization_utils_base.py:1852] 2023-12-04 13:35:51,622 >> loading file tokenizer_config.json from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\tokenizer_config.json
[INFO|configuration_utils.py:715] 2023-12-04 13:35:51,899 >> loading configuration file config.json from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\config.json
[INFO|configuration_utils.py:715] 2023-12-04 13:35:52,352 >> loading configuration file config.json from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\config.json
[INFO|configuration_utils.py:775] 2023-12-04 13:35:52,353 >> Model config BaichuanConfig {
"_from_model_config": true,
"_name_or_path": "baichuan-inc/Baichuan-13B-Base",
"architectures": [
"BaichuanForCausalLM"
],
"auto_map": {
"AutoConfig": "baichuan-inc/Baichuan-13B-Base--configuration_baichuan.BaichuanConfig",
"AutoModelForCausalLM": "baichuan-inc/Baichuan-13B-Base--modeling_baichuan.BaichuanForCausalLM"
},
"bos_token_id": 1,
"eos_token_id": 2,
"gradient_checkpointing": [
false
],
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13696,
"model_max_length": 4096,
"model_type": "baichuan",
"num_attention_heads": 40,
"num_hidden_layers": 40,
"pad_token_id": 0,
"rms_norm_eps": 1e-06,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.33.0",
"use_cache": true,
"vocab_size": 64000
}

12/04/2023 13:35:52 - INFO - llmtuner.model.loader - Quantizing model to 4 bit.
[INFO|modeling_utils.py:2857] 2023-12-04 13:35:53,273 >> loading weights file pytorch_model.bin from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\pytorch_model.bin.index.json
[INFO|modeling_utils.py:1200] 2023-12-04 13:35:53,281 >> Instantiating BaichuanForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:768] 2023-12-04 13:35:53,282 >> Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "4.33.0"
}

[INFO|modeling_utils.py:2971] 2023-12-04 13:35:53,675 >> Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 3/3 [00:24<00:00, 8.06s/it]
[INFO|modeling_utils.py:3643] 2023-12-04 13:36:17,928 >> All model checkpoint weights were used when initializing BaichuanForCausalLM.

[INFO|modeling_utils.py:3651] 2023-12-04 13:36:17,928 >> All the weights of BaichuanForCausalLM were initialized from the model checkpoint at baichuan-inc/Baichuan-13B-Base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BaichuanForCausalLM for predictions without further training.
[INFO|configuration_utils.py:730] 2023-12-04 13:36:18,158 >> loading configuration file generation_config.json from cache at C:\Users\Administrator/.cache\huggingface\hub\models--baichuan-inc--Baichuan-13B-Base\snapshots\0ef0739c7bdd34df954003ef76d80f3dabca2ff9\generation_config.json
[INFO|configuration_utils.py:768] 2023-12-04 13:36:18,158 >> Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "4.33.0"
}

12/04/2023 13:36:18 - INFO - llmtuner.model.utils - Upcasting weights in layernorm in float32.
12/04/2023 13:36:18 - INFO - llmtuner.model.utils - Gradient checkpointing enabled.
12/04/2023 13:36:18 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
12/04/2023 13:36:18 - INFO - llmtuner.model.loader - trainable params: 6553600 || all params: 13271454720 || trainable%: 0.0494
[INFO|tokenization_utils_base.py:926] 2023-12-04 13:36:18,330 >> Assigning [] to the additional_special_tokens key of the tokenizer
Loading cached processed dataset at C:\Users\Administrator.cache\huggingface\datasets\json\default-2e9a277fc6a494f2\0.0.0\8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96\cache-d211757e67af8dbe.arrow
input_ids:
[703, 9805, 1493, 650, 15695, 3475, 680, 755, 1313, 9715, 945, 5840, 9091, 79, 776, 9091, 4278, 8922, 31125, 7320, 31125, 680, 1265, 952, 8943, 678, 656, 3475, 31155, 31114, 3155, 79, 31106, 5, 5132, 31143, 31106, 4550, 19463, 7841, 7868, 73, 5, 7905, 18056, 31143, 31106, 4567, 31161, 4550, 19463, 7841, 7868, 77, 5, 5, 53, 79, 31106, 4550, 3606, 2148, 73, 3526, 31345, 11886, 31135, 3606, 4467, 72, 31248, 32188, 31583, 76, 21332, 31399, 17268, 72, 31196, 6520, 28165, 2337, 72, 7552, 12421, 6029, 72, 31404, 20387, 5972, 16573, 73, 5, 5, 54, 79, 31106, 24691, 9945, 73, 3526, 11164, 12420, 31135, 11748, 76, 11603, 76, 31233, 32570, 31368, 31188, 12019, 13443, 31664, 31135, 18085, 6768, 72, 6076, 31229, 32242, 76, 31229, 12019, 31188, 10523, 6186, 72, 31187, 4550, 19463, 9945, 6269, 73, 5, 5, 55, 79, 31106, 11923, 15932, 73, 11923, 31209, 7776, 2337, 31475, 31262, 2462, 72, 17951, 3526, 31363, 6196, 31106, 59, 31136, 60, 31106, 4237, 31135, 11923, 73, 9636, 11923, 20387, 17832, 6550, 72, 6520, 3606, 6691, 72, 31404, 3806, 3300, 22645, 9684, 31258, 73, 2]
inputs:
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: 保持健康的三个提示。
Assistant: 以下是保持健康的三个提示:

  1. 保持身体活动。每天做适当的身体运动,如散步、跑步或游泳,能促进心血管健康,增强肌肉力量,并有助于减少体重。

  2. 均衡饮食。每天食用新鲜的蔬菜、水果、全谷物和脂肪含量低的蛋白质食物,避免高糖、高脂肪和加工食品,以保持健康的饮食习惯 。

  3. 睡眠充足。睡眠对人体健康至关重要,成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力,促进身体恢复,并提高注意力和记忆力。
    label_ids:
    [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 31106, 4567, 31161, 4550, 19463, 7841, 7868, 77, 5, 5, 53, 79, 31106, 4550, 3606, 2148, 73, 3526, 31345, 11886, 31135, 3606, 4467, 72, 31248, 32188, 31583, 76, 21332, 31399, 17268, 72, 31196, 6520, 28165, 2337, 72, 7552, 12421, 6029, 72, 31404, 20387, 5972, 16573, 73, 5, 5, 54, 79, 31106, 24691, 9945, 73, 3526, 11164, 12420, 31135, 11748, 76, 11603, 76, 31233, 32570, 31368, 31188, 12019, 13443, 31664, 31135, 18085, 6768, 72, 6076, 31229, 32242, 76, 31229, 12019, 31188, 10523, 6186, 72, 31187, 4550, 19463, 9945, 6269, 73, 5, 5, 55, 79, 31106, 11923, 15932, 73, 11923, 31209, 7776, 2337, 31475, 31262, 2462, 72, 17951, 3526, 31363, 6196, 31106, 59, 31136, 60, 31106, 4237, 31135, 11923, 73, 9636, 11923, 20387, 17832, 6550, 72, 6520, 3606, 6691, 72, 31404, 3806, 3300, 22645, 9684, 31258, 73, 2]
    labels:
    以下是保持健康的三个提示:

  4. 保持身体活动。每天做适当的身体运动,如散步、跑步或游泳,能促进心血管健康,增强肌肉力量,并有助于减少体重。

  5. 均衡饮食。每天食用新鲜的蔬菜、水果、全谷物和脂肪含量低的蛋白质食物,避免高糖、高脂肪和加工食品,以保持健康的饮食习惯 。

  6. 睡眠充足。睡眠对人体健康至关重要,成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力,促进身体恢复,并提高注意力和记忆力。
    [INFO|training_args.py:1332] 2023-12-04 13:36:18,535 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
    [INFO|training_args.py:1764] 2023-12-04 13:36:18,535 >> PyTorch: setting up devices
    [INFO|trainer.py:403] 2023-12-04 13:36:18,537 >> The model is quantized. To train this model you need to add additional modules inside the model such as adapters using peft library and freeze the model weights. Please check the examples in https://github.com/huggingface/peft for more details.
    [INFO|trainer.py:1712] 2023-12-04 13:36:18,646 >> ***** Running training *****
    [INFO|trainer.py:1713] 2023-12-04 13:36:18,647 >> Num examples = 75,775
    [INFO|trainer.py:1714] 2023-12-04 13:36:18,647 >> Num Epochs = 2
    [INFO|trainer.py:1715] 2023-12-04 13:36:18,647 >> Instantaneous batch size per device = 4
    [INFO|trainer.py:1718] 2023-12-04 13:36:18,648 >> Total train batch size (w. parallel, distributed & accumulation) = 32
    [INFO|trainer.py:1719] 2023-12-04 13:36:18,648 >> Gradient Accumulation steps = 8
    [INFO|trainer.py:1720] 2023-12-04 13:36:18,648 >> Total optimization steps = 4,736
    [INFO|trainer.py:1721] 2023-12-04 13:36:18,649 >> Number of trainable parameters = 6,553,600
    D:\softwares\Anaconda3\envs\zh\lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
    warnings.warn(
    12/04/2023 13:38:06 - INFO - llmtuner.extras.callbacks - {'loss': 1.6244, 'learning_rate': 5.0000e-05, 'epoch': 0.00}
    {'loss': 1.6244, 'learning_rate': 4.999986249262817e-05, 'epoch': 0.0}
    OMP: Error Train using qlora exist with error  #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
    OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
    C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:2904: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit

@hiyouga
Copy link
Owner

hiyouga commented Dec 4, 2023

监视一下系统资源,是否是内存溢出

@dreamkillers666
Copy link
Author

不是内存溢出,内存只使用了11%上下

@zhongzhubailong
Copy link

网上找到解决方案了,在conda文件夹搜索libiomp5md.dll文件,把Library\bin目录下的libiomp5md.dll文件删掉即可

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants