Add Ascend NPU support #5541

wangshuai09 · 2024-02-19T11:30:54Z

Checklist:

I have read the Contributing guidelines

Description

Ascend NPU has been supported by transformers, deepspeed and others. Based on above works, i want to use Ascend NPU to chat and i also find there are people who wants to use Ascend NPU(Open: #5261).

This PR make auto installing success and has passed below tests:

3 interface modes
deepspeed feature
lora training

Verified on Ascend NPU with ChatGLM2-6B,

> bash start_linux.sh --listen --trust-remote-code
08:58:04-725040 INFO     Starting Text generation web UI
08:58:04-732127 WARNING  trust_remote_code is enabled. This is dangerous.
08:58:04-733836 WARNING
                         You are potentially exposing the web UI to the entire internet without any access password.
                         You can create one with the "--gradio-auth" flag like this:

                         --gradio-auth username:password

                         Make sure to replace username:password with your own.
08:58:04-737644 INFO     Loading the extension "gallery"
Running on local URL:  <http://0.0.0.0:7860>

To create a public link, set `share=True` in `launch()`.
08:58:20-113717 INFO     Loading "chatglm2-6b"
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:05<00:00,  1.19it/s]
08:58:42-536748 INFO     LOADER: "Transformers"
08:58:42-539892 INFO     TRUNCATION LENGTH: 2048
08:58:42-541635 INFO     INSTRUCTION TEMPLATE: "ChatGLM"
08:58:42-543415 INFO     Loaded the model in 22.43 seconds.
09:07:09-065736 INFO     Deleted "logs/chat/Assistant/20240219-09-06-38.json".
Output generated in 2.55 seconds (9.02 tokens/s, 23 tokens, context 67, seed 1069506189)
Output generated in 2.70 seconds (9.27 tokens/s, 25 tokens, context 99, seed 1782759126)
Output generated in 4.45 seconds (10.78 tokens/s, 48 tokens, context 133, seed 1680409240)

Chat
Default
Notebook

deepspeed

> bash start_linux.sh --listen --trust-remote-code --deepspeed
[2024-02-19 09:10:21,694] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
/home/wangshuai/downloads/src/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-02-19 09:10:34,113] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-19 09:10:34,114] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2024-02-19 09:10:34,196] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.0.235, master_port=29500
[2024-02-19 09:10:34,196] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend hccl
09:10:34-547232 INFO     Starting Text generation web UI
09:10:34-554936 WARNING  trust_remote_code is enabled. This is dangerous.
09:10:34-556662 WARNING
                         You are potentially exposing the web UI to the entire internet without any access password.
                         You can create one with the "--gradio-auth" flag like this:

                         --gradio-auth username:password

                         Make sure to replace username:password with your own.
09:10:34-560776 INFO     Loading the extension "gallery"
Running on local URL:  <http://0.0.0.0:7860>

To create a public link, set `share=True` in `launch()`.
09:10:45-100323 INFO     Loading "chatglm2-6b"
[2024-02-19 09:10:51,866] [INFO] [partition_parameters.py:343:__exit__] finished initializing model - num_params = 199, num_elems = 6.24B
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:12<00:00,  1.76s/it]
[2024-02-19 09:11:04,252] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.2, git-hash=unknown, git-branch=unknown
[2024-02-19 09:11:04,270] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-02-19 09:11:04,273] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2024-02-19 09:11:04,730] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-02-19 09:11:04,733] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.0 GB         Max_CA 1 GB
[2024-02-19 09:11:04,733] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 224.34 GB, percent = 14.8%
Parameter Offload: Total persistent parameters: 362496 in 85 params
[2024-02-19 09:11:05,599] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-02-19 09:11:05,601] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.0 GB         Max_CA 1 GB
[2024-02-19 09:11:05,601] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 224.33 GB, percent = 14.8%
[2024-02-19 09:11:05,603] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
[2024-02-19 09:11:05,603] [INFO] [config.py:991:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2024-02-19 09:11:05,603] [INFO] [config.py:991:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-02-19 09:11:05,603] [INFO] [config.py:991:print]   amp_enabled .................. False
[2024-02-19 09:11:05,603] [INFO] [config.py:991:print]   amp_params ................... False
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   bfloat16_enabled ............. False
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   checkpoint_parallel_write_pipeline  False
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   checkpoint_tag_validation_enabled  True
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   checkpoint_tag_validation_fail  False
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0xffff9aa46500>
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   communication_data_type ...... None
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   compile_config ............... enabled=False backend='inductor' kwargs={}
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   curriculum_enabled_legacy .... False
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   curriculum_params_legacy ..... False
[2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   data_efficiency_enabled ...... False
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   dataloader_drop_last ......... False
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   disable_allgather ............ False
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   dump_state ................... False
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   dynamic_loss_scale_args ...... None
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_enabled ........... False
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_gas_boundary_resolution  1
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_layer_num ......... 0
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_max_iter .......... 100
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_stability ......... 1e-06
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_tol ............... 0.01
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_verbose ........... False
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   elasticity_enabled ........... False
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   flops_profiler_config ........ {
    "enabled": false,
    "recompute_fwd_factor": 0.0,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   fp16_auto_cast ............... False
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   fp16_enabled ................. True
[2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   fp16_master_weights_and_gradients  False
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   global_rank .................. 0
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   grad_accum_dtype ............. None
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   gradient_accumulation_steps .. 1
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   gradient_clipping ............ 0.0
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   gradient_predivide_factor .... 1.0
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   graph_harvesting ............. False
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   initial_dynamic_scale ........ 65536
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   load_universal_checkpoint .... False
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   loss_scale ................... 0
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   memory_breakdown ............. False
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   mics_hierarchial_params_gather  False
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   mics_shard_size .............. -1
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   optimizer_legacy_fusion ...... False
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   optimizer_name ............... None
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   optimizer_params ............. None
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   pld_enabled .................. False
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   pld_params ................... False
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   prescale_gradients ........... False
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   scheduler_name ............... None
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   scheduler_params ............. None
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   seq_parallel_communication_data_type  torch.float32
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   sparse_attention ............. None
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   sparse_gradients_enabled ..... False
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   steps_per_print .............. 2000
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   train_batch_size ............. 1
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   train_micro_batch_size_per_gpu  1
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   use_data_before_expert_parallel_  False
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   use_node_local_storage ....... False
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   wall_clock_breakdown ......... False
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   weight_quantization_config ... None
[2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   world_size ................... 1
[2024-02-19 09:11:05,608] [INFO] [config.py:991:print]   zero_allow_untested_optimizer  False
[2024-02-19 09:11:05,608] [INFO] [config.py:991:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-02-19 09:11:05,608] [INFO] [config.py:991:print]   zero_enabled ................. True
[2024-02-19 09:11:05,608] [INFO] [config.py:991:print]   zero_force_ds_cpu_optimizer .. True
[2024-02-19 09:11:05,608] [INFO] [config.py:991:print]   zero_optimization_stage ...... 3
[2024-02-19 09:11:05,608] [INFO] [config.py:977:print_user_config]   json = {
    "fp16": {
        "enabled": true
    },
    "bf16": {
        "enabled": false
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": "auto",
        "stage3_max_reuse_distance": "auto"
    },
    "steps_per_print": 2.000000e+03,
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": false
}
09:11:05-611003 INFO     DeepSpeed ZeRO-3 is enabled: True
09:11:05-915391 INFO     LOADER: "Transformers"
09:11:05-918768 INFO     TRUNCATION LENGTH: 2048
09:11:05-920442 INFO     INSTRUCTION TEMPLATE: "ChatGLM"
09:11:05-922044 INFO     Loaded the model in 20.82 seconds.
Output generated in 17.08 seconds (0.94 tokens/s, 16 tokens, context 65, seed 707848364)
Output generated in 19.78 seconds (1.26 tokens/s, 25 tokens, context 90, seed 1801174197)
Output generated in 21.71 seconds (1.29 tokens/s, 28 tokens, context 122, seed 1289303107)

Verified on Ascned NPU with opt-1.3b：

lora training

07:35:23-151777 INFO     Loading "opt-1.3b"                                                                                                                       
07:35:25-252973 INFO     LOADER: "Transformers"                                                                                                                   
07:35:25-255518 INFO     TRUNCATION LENGTH: 2048                                                                                                                  
07:35:25-257179 INFO     INSTRUCTION TEMPLATE: "Alpaca"                                                                                                           
07:35:25-258768 INFO     Loaded the model in 2.10 seconds.                                                                                                        
07:35:50-221577 INFO     Loading JSON datasets                                                                                                                    
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:02<00:00, 741.50 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:02<00:00, 757.32 examples/s]
07:36:00-266715 INFO     Getting model ready                                                                                                                      
07:36:00-269457 INFO     Preparing for training                                                                                                                   
07:36:00-273396 INFO     Creating LoRA model                                                                                                                      
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
07:36:05-214395 INFO     Starting training                                                                                                                        
Training 'opt' model using (q, v) projections
Trainable params: 6,291,456 (0.4759 %), All params: 1,322,049,536 (Model: 1,315,758,080)
07:36:05-275604 INFO     Log file 'train_dataset_sample.json' created in the 'logs' directory.                                                                    
start training
wandb: Tracking run with wandb version 0.16.3
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
[W AmpForeachNonFiniteCheckAndUnscaleKernelNpuOpApi.cpp:104] Warning: Non finite check and unscale on NPU device! (function operator())
Step: 127 {'eval_loss': 2.1790549755096436, 'eval_runtime': 71.7705, 'eval_samples_per_second': 27.867, 'eval_steps_per_second': 3.483, 'epoch': 0.26}
Step: 159 {'loss': 2.2859, 'learning_rate': 0.0002926829268292683, 'epoch': 0.32}
Step: 255 {'eval_loss': 2.0011792182922363, 'eval_runtime': 65.6536, 'eval_samples_per_second': 30.463, 'eval_steps_per_second': 3.808, 'epoch': 0.51}
Step: 319 {'loss': 2.0853, 'learning_rate': 0.0002560975609756097, 'epoch': 0.64}
Step: 383 {'eval_loss': 1.9035134315490723, 'eval_runtime': 59.9999, 'eval_samples_per_second': 33.333, 'eval_steps_per_second': 4.167, 'epoch': 0.77}
Step: 479 {'loss': 1.9536, 'learning_rate': 0.0002195121951219512, 'epoch': 0.96}
Step: 491 {'eval_loss': 1.841735601425171, 'eval_runtime': 59.7168, 'eval_samples_per_second': 33.491, 'eval_steps_per_second': 4.186, 'epoch': 1.02}
Step: 619 {'loss': 1.8656, 'learning_rate': 0.00018292682926829266, 'epoch': 1.28}
Step: 619 {'eval_loss': 1.8136059045791626, 'eval_runtime': 60.3887, 'eval_samples_per_second': 33.119, 'eval_steps_per_second': 4.14, 'epoch': 1.28}
Step: 747 {'eval_loss': 1.7972328662872314, 'eval_runtime': 60.1174, 'eval_samples_per_second': 33.268, 'eval_steps_per_second': 4.159, 'epoch': 1.54}
Step: 779 {'loss': 1.8519, 'learning_rate': 0.00014634146341463414, 'epoch': 1.6}
Step: 875 {'eval_loss': 1.7847641706466675, 'eval_runtime': 61.1648, 'eval_samples_per_second': 32.699, 'eval_steps_per_second': 4.087, 'epoch': 1.79}
Step: 939 {'loss': 1.8669, 'learning_rate': 0.0001097560975609756, 'epoch': 1.92}
Step: 1015 {'eval_loss': 1.774510383605957, 'eval_runtime': 60.883, 'eval_samples_per_second': 32.85, 'eval_steps_per_second': 4.106, 'epoch': 2.05}
Step: 1111 {'loss': 1.8282, 'learning_rate': 7.317073170731707e-05, 'epoch': 2.24}
Step: 1143 {'eval_loss': 1.7672193050384521, 'eval_runtime': 60.3961, 'eval_samples_per_second': 33.115, 'eval_steps_per_second': 4.139, 'epoch': 2.3}
Step: 1271 {'loss': 1.7926, 'learning_rate': 3.6585365853658535e-05, 'epoch': 2.56}
Step: 1271 {'eval_loss': 1.7625163793563843, 'eval_runtime': 60.0916, 'eval_samples_per_second': 33.282, 'eval_steps_per_second': 4.16, 'epoch': 2.56}
Step: 1399 {'eval_loss': 1.760359764099121, 'eval_runtime': 63.8109, 'eval_samples_per_second': 31.343, 'eval_steps_per_second': 3.918, 'epoch': 2.82}
Step: 1431 {'loss': 1.7982, 'learning_rate': 0.0, 'epoch': 2.88}
Step: 1431 {'train_runtime': 1744.3223, 'train_samples_per_second': 3.44, 'train_steps_per_second': 0.026, 'train_loss': 1.9253530502319336, 'epoch': 2.88}
end training
08:05:11-100991 INFO     LoRA training run is completed and saved.

Malrama · 2024-02-19T13:38:18Z

Why Huawai? What do NPUs have to do with Huawai? Is this some kind of advertisement?
I would strongly suggest renaming "Huawei Ascend" into simply "NPU". This is not specific to Huawai in any way.

wangshuai09 · 2024-02-20T01:37:39Z

@Malrama I have just followed the behavior of Nvidia/AMD/Apple gpus. If there are another npu device wants to adapt, shold add another option such F) xxx which would install requirements for xxx. Thank you anyway, i will delete "Huawei" to avoid misunderstanding.
You can learn about Ascend from Ascend Community.

OKN1212 · 2024-02-20T12:08:02Z

I think it's fine to leave Huawei, the code list chip vendors like AMD, Intel, and Apple but that doesn't mean it's endorsing said companies. In fact "Ascend" might be confusing down the line because that's like listing Apple chip as "M" and Intel as "Arc".

wangshuai09 · 2024-02-21T02:26:15Z

@OKN1212 thanks for your review. I think "Ascend" is better for user and could not make confusion like "Apple" because all Ascend series use the same version of torch and torch_npu. If there are different Ascend chip using different torch/torch_npu in the future, "Ascend" here may not be appropriate.

wangshuai09 · 2024-03-05T09:38:11Z

Hi @oobabooga , thank you for this great work! I find this is applicated for multiple backends and there are people who wants to use Ascend NPU. For Ascend NPU, it was supported by Pytorch officially since Pytorch 2.1 and people could use that for LLM training and inference.
This PR make it possible to use Ascend NPU. Could you please tell me what other things need me to do? If so, can you help to review this code :)

Yikun · 2024-03-11T02:00:44Z

Any plan to support this? @oobabooga Looks like @Malrama @OKN1212 's question has been resolved.

When I try to use text-generation-webui, I just noticed it haven't support yet, then I just found this PR, it works!

BTW, I also found stable-diffusion-webui 1.8 has been released [1], thanks for your work! @wangshuai09

[1] AUTOMATIC1111/stable-diffusion-webui@96b5504

wangshuai09 · 2024-03-30T03:44:54Z

Hi @oobabooga , this PR has been pending for over a month. May I know if adding new backends is supported?

Touch-Night · 2024-04-06T17:36:52Z

@oobabooga
Text-generation-webui's goal is to become the AUTOMATIC1111/stable-diffusion-webui of text generation. Now stable-diffusion-webui supports Huawei Ascend NPU. Any reactions?

oobabooga · 2024-04-11T21:41:26Z

I am not familiar with NPUs and do not have access to one, so I have reverted the various changes to the one-click installer and removed temporary workarounds, keeping the changes minimal.

Touch-Night · 2024-04-12T00:46:11Z

Then which one should I select when using one click installer if I have a Huawei Ascend NPU?

wangshuai09 · 2024-04-12T07:09:40Z

Thanks for your reply! I almost thought this job was going to fail.
Because one-click installer and workarounds are removed, I may need to write a wiki to apply patches for using Ascend backend.
In the future, i will try to supply a Ascend NPU device for running CI and other validation work.

wangshuai09 · 2024-04-18T12:35:24Z

@Touch-Night I‘ve write a wiki for patch upstream as a temporary alternative. Althogh this is not fancy, it can support you guys use Ascend NPU.
https://github.com/wangshuai09/text-generation-webui/wiki/Install-and-run-on-Ascend-NPUs.

Touch-Night · 2024-04-18T16:12:46Z

Thanks, but I have already reverted all changes by @oobabooga turning Huawei Ascend NPU support into a general NPU support in my own fork for localization.
By the way I found a typo in your wiki:

Preparation: Before installing ~~stable-diffusion-webui~~ for NPU, you should make sure that have installed the right CANN toolkit and kernels.

cd text-generatioin-webui

Ginray · 2024-05-27T11:24:47Z

nice job

wangshuai09 marked this pull request as ready for review February 20, 2024 10:52

wangshuai09 force-pushed the npu_support branch from 088d2dc to bc1f4ed Compare March 4, 2024 10:10

wangshuai09 added 5 commits March 5, 2024 16:15

add ascend npu support

00a361c

update

30f8541

update

dc4fb64

lora training

a5ee8a1

update

7a3ce65

wangshuai09 force-pushed the npu_support branch from 789e5f8 to 7a3ce65 Compare March 5, 2024 08:22

oobabooga added 4 commits April 11, 2024 14:34

Merge branch 'dev' into wangshuai09-npu_support

eb06e7e

Remove from one-click installer

e4abede

Lint

8ec2d65

Clean up

5348558

oobabooga merged commit fd4e46b into oobabooga:dev Apr 11, 2024

PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Oct 22, 2024

Add Ascend NPU support (basic) (oobabooga#5541)

0f2ef33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Ascend NPU support #5541

Add Ascend NPU support #5541

wangshuai09 commented Feb 19, 2024 •

edited

Loading

Malrama commented Feb 19, 2024

wangshuai09 commented Feb 20, 2024 •

edited

Loading

OKN1212 commented Feb 20, 2024

wangshuai09 commented Feb 21, 2024

wangshuai09 commented Mar 5, 2024

Yikun commented Mar 11, 2024 •

edited

Loading

wangshuai09 commented Mar 30, 2024

Touch-Night commented Apr 6, 2024

oobabooga commented Apr 11, 2024

Touch-Night commented Apr 12, 2024

wangshuai09 commented Apr 12, 2024

wangshuai09 commented Apr 18, 2024 •

edited

Loading

Touch-Night commented Apr 18, 2024 •

edited

Loading

Ginray commented May 27, 2024

Add Ascend NPU support #5541

Add Ascend NPU support #5541

Conversation

wangshuai09 commented Feb 19, 2024 • edited Loading

Checklist:

Description

Malrama commented Feb 19, 2024

wangshuai09 commented Feb 20, 2024 • edited Loading

OKN1212 commented Feb 20, 2024

wangshuai09 commented Feb 21, 2024

wangshuai09 commented Mar 5, 2024

Yikun commented Mar 11, 2024 • edited Loading

wangshuai09 commented Mar 30, 2024

Touch-Night commented Apr 6, 2024

oobabooga commented Apr 11, 2024

Touch-Night commented Apr 12, 2024

wangshuai09 commented Apr 12, 2024

wangshuai09 commented Apr 18, 2024 • edited Loading

Touch-Night commented Apr 18, 2024 • edited Loading

Ginray commented May 27, 2024

wangshuai09 commented Feb 19, 2024 •

edited

Loading

wangshuai09 commented Feb 20, 2024 •

edited

Loading

Yikun commented Mar 11, 2024 •

edited

Loading

wangshuai09 commented Apr 18, 2024 •

edited

Loading

Touch-Night commented Apr 18, 2024 •

edited

Loading