Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ascend NPU support #5541

Merged
merged 9 commits into from
Apr 11, 2024
Merged

Add Ascend NPU support #5541

merged 9 commits into from
Apr 11, 2024

Conversation

wangshuai09
Copy link
Contributor

@wangshuai09 wangshuai09 commented Feb 19, 2024

Checklist:

Description

Ascend NPU has been supported by transformers, deepspeed and others. Based on above works, i want to use Ascend NPU to chat and i also find there are people who wants to use Ascend NPU(Open: #5261).

This PR make auto installing success and has passed below tests:

  • 3 interface modes
  • deepspeed feature
  • lora training

Verified on Ascend NPU with ChatGLM2-6B,

> bash start_linux.sh --listen --trust-remote-code
08:58:04-725040 INFO     Starting Text generation web UI
08:58:04-732127 WARNING  trust_remote_code is enabled. This is dangerous.
08:58:04-733836 WARNING
                         You are potentially exposing the web UI to the entire internet without any access password.
                         You can create one with the "--gradio-auth" flag like this:

                         --gradio-auth username:password

                         Make sure to replace username:password with your own.
08:58:04-737644 INFO     Loading the extension "gallery"
Running on local URL:  <http://0.0.0.0:7860>

To create a public link, set `share=True` in `launch()`.
08:58:20-113717 INFO     Loading "chatglm2-6b"
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:05<00:00,  1.19it/s]
08:58:42-536748 INFO     LOADER: "Transformers"
08:58:42-539892 INFO     TRUNCATION LENGTH: 2048
08:58:42-541635 INFO     INSTRUCTION TEMPLATE: "ChatGLM"
08:58:42-543415 INFO     Loaded the model in 22.43 seconds.
09:07:09-065736 INFO     Deleted "logs/chat/Assistant/20240219-09-06-38.json".
Output generated in 2.55 seconds (9.02 tokens/s, 23 tokens, context 67, seed 1069506189)
Output generated in 2.70 seconds (9.27 tokens/s, 25 tokens, context 99, seed 1782759126)
Output generated in 4.45 seconds (10.78 tokens/s, 48 tokens, context 133, seed 1680409240)

  • Chat

    alt

  • Default

    alt

  • Notebook

    alt

  • deepspeed

    > bash start_linux.sh --listen --trust-remote-code --deepspeed
    [2024-02-19 09:10:21,694] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
    /home/wangshuai/downloads/src/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
      warnings.warn(
    [2024-02-19 09:10:34,113] [INFO] [comm.py:637:init_distributed] cdb=None
    [2024-02-19 09:10:34,114] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
    [2024-02-19 09:10:34,196] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.0.235, master_port=29500
    [2024-02-19 09:10:34,196] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend hccl
    09:10:34-547232 INFO     Starting Text generation web UI
    09:10:34-554936 WARNING  trust_remote_code is enabled. This is dangerous.
    09:10:34-556662 WARNING
                             You are potentially exposing the web UI to the entire internet without any access password.
                             You can create one with the "--gradio-auth" flag like this:
    
                             --gradio-auth username:password
    
                             Make sure to replace username:password with your own.
    09:10:34-560776 INFO     Loading the extension "gallery"
    Running on local URL:  <http://0.0.0.0:7860>
    
    To create a public link, set `share=True` in `launch()`.
    09:10:45-100323 INFO     Loading "chatglm2-6b"
    [2024-02-19 09:10:51,866] [INFO] [partition_parameters.py:343:__exit__] finished initializing model - num_params = 199, num_elems = 6.24B
    Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:12<00:00,  1.76s/it]
    [2024-02-19 09:11:04,252] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.2, git-hash=unknown, git-branch=unknown
    [2024-02-19 09:11:04,270] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
    [2024-02-19 09:11:04,273] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
    [2024-02-19 09:11:04,730] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
    [2024-02-19 09:11:04,733] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.0 GB         Max_CA 1 GB
    [2024-02-19 09:11:04,733] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 224.34 GB, percent = 14.8%
    Parameter Offload: Total persistent parameters: 362496 in 85 params
    [2024-02-19 09:11:05,599] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
    [2024-02-19 09:11:05,601] [INFO] [utils.py:801:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.0 GB         Max_CA 1 GB
    [2024-02-19 09:11:05,601] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 224.33 GB, percent = 14.8%
    [2024-02-19 09:11:05,603] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
    [2024-02-19 09:11:05,603] [INFO] [config.py:991:print]   activation_checkpointing_config  {
        "partition_activations": false,
        "contiguous_memory_optimization": false,
        "cpu_checkpointing": false,
        "number_checkpoints": null,
        "synchronize_checkpoint_boundary": false,
        "profile": false
    }
    [2024-02-19 09:11:05,603] [INFO] [config.py:991:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
    [2024-02-19 09:11:05,603] [INFO] [config.py:991:print]   amp_enabled .................. False
    [2024-02-19 09:11:05,603] [INFO] [config.py:991:print]   amp_params ................... False
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   autotuning_config ............ {
        "enabled": false,
        "start_step": null,
        "end_step": null,
        "metric_path": null,
        "arg_mappings": null,
        "metric": "throughput",
        "model_info": null,
        "results_dir": "autotuning_results",
        "exps_dir": "autotuning_exps",
        "overwrite": true,
        "fast": true,
        "start_profile_step": 3,
        "end_profile_step": 5,
        "tuner_type": "gridsearch",
        "tuner_early_stopping": 5,
        "tuner_num_trials": 50,
        "model_info_path": null,
        "mp_size": 1,
        "max_train_batch_size": null,
        "min_train_batch_size": 1,
        "max_train_micro_batch_size_per_gpu": 1.024000e+03,
        "min_train_micro_batch_size_per_gpu": 1,
        "num_tuning_micro_batch_sizes": 3
    }
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   bfloat16_enabled ............. False
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   checkpoint_parallel_write_pipeline  False
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   checkpoint_tag_validation_enabled  True
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   checkpoint_tag_validation_fail  False
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0xffff9aa46500>
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   communication_data_type ...... None
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   compile_config ............... enabled=False backend='inductor' kwargs={}
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   curriculum_enabled_legacy .... False
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   curriculum_params_legacy ..... False
    [2024-02-19 09:11:05,604] [INFO] [config.py:991:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   data_efficiency_enabled ...... False
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   dataloader_drop_last ......... False
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   disable_allgather ............ False
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   dump_state ................... False
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   dynamic_loss_scale_args ...... None
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_enabled ........... False
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_gas_boundary_resolution  1
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_layer_name ........ bert.encoder.layer
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_layer_num ......... 0
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_max_iter .......... 100
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_stability ......... 1e-06
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_tol ............... 0.01
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   eigenvalue_verbose ........... False
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   elasticity_enabled ........... False
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   flops_profiler_config ........ {
        "enabled": false,
        "recompute_fwd_factor": 0.0,
        "profile_step": 1,
        "module_depth": -1,
        "top_modules": 1,
        "detailed": true,
        "output_file": null
    }
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   fp16_auto_cast ............... False
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   fp16_enabled ................. True
    [2024-02-19 09:11:05,605] [INFO] [config.py:991:print]   fp16_master_weights_and_gradients  False
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   global_rank .................. 0
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   grad_accum_dtype ............. None
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   gradient_accumulation_steps .. 1
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   gradient_clipping ............ 0.0
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   gradient_predivide_factor .... 1.0
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   graph_harvesting ............. False
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   initial_dynamic_scale ........ 65536
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   load_universal_checkpoint .... False
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   loss_scale ................... 0
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   memory_breakdown ............. False
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   mics_hierarchial_params_gather  False
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   mics_shard_size .............. -1
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   nebula_config ................ {
        "enabled": false,
        "persistent_storage_path": null,
        "persistent_time_interval": 100,
        "num_of_version_in_retention": 2,
        "enable_nebula_load": true,
        "load_path": null
    }
    [2024-02-19 09:11:05,606] [INFO] [config.py:991:print]   optimizer_legacy_fusion ...... False
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   optimizer_name ............... None
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   optimizer_params ............. None
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   pld_enabled .................. False
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   pld_params ................... False
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   prescale_gradients ........... False
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   scheduler_name ............... None
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   scheduler_params ............. None
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   seq_parallel_communication_data_type  torch.float32
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   sparse_attention ............. None
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   sparse_gradients_enabled ..... False
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   steps_per_print .............. 2000
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   train_batch_size ............. 1
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   train_micro_batch_size_per_gpu  1
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   use_data_before_expert_parallel_  False
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   use_node_local_storage ....... False
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   wall_clock_breakdown ......... False
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   weight_quantization_config ... None
    [2024-02-19 09:11:05,607] [INFO] [config.py:991:print]   world_size ................... 1
    [2024-02-19 09:11:05,608] [INFO] [config.py:991:print]   zero_allow_untested_optimizer  False
    [2024-02-19 09:11:05,608] [INFO] [config.py:991:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
    [2024-02-19 09:11:05,608] [INFO] [config.py:991:print]   zero_enabled ................. True
    [2024-02-19 09:11:05,608] [INFO] [config.py:991:print]   zero_force_ds_cpu_optimizer .. True
    [2024-02-19 09:11:05,608] [INFO] [config.py:991:print]   zero_optimization_stage ...... 3
    [2024-02-19 09:11:05,608] [INFO] [config.py:977:print_user_config]   json = {
        "fp16": {
            "enabled": true
        },
        "bf16": {
            "enabled": false
        },
        "zero_optimization": {
            "stage": 3,
            "offload_param": {
                "device": "cpu",
                "pin_memory": true
            },
            "overlap_comm": true,
            "contiguous_gradients": true,
            "reduce_bucket_size": "auto",
            "stage3_prefetch_bucket_size": "auto",
            "stage3_param_persistence_threshold": "auto",
            "stage3_max_live_parameters": "auto",
            "stage3_max_reuse_distance": "auto"
        },
        "steps_per_print": 2.000000e+03,
        "train_batch_size": 1,
        "train_micro_batch_size_per_gpu": 1,
        "wall_clock_breakdown": false
    }
    09:11:05-611003 INFO     DeepSpeed ZeRO-3 is enabled: True
    09:11:05-915391 INFO     LOADER: "Transformers"
    09:11:05-918768 INFO     TRUNCATION LENGTH: 2048
    09:11:05-920442 INFO     INSTRUCTION TEMPLATE: "ChatGLM"
    09:11:05-922044 INFO     Loaded the model in 20.82 seconds.
    Output generated in 17.08 seconds (0.94 tokens/s, 16 tokens, context 65, seed 707848364)
    Output generated in 19.78 seconds (1.26 tokens/s, 25 tokens, context 90, seed 1801174197)
    Output generated in 21.71 seconds (1.29 tokens/s, 28 tokens, context 122, seed 1289303107)

Verified on Ascned NPU with opt-1.3b:

  • lora training
    image

    07:35:23-151777 INFO     Loading "opt-1.3b"                                                                                                                       
    07:35:25-252973 INFO     LOADER: "Transformers"                                                                                                                   
    07:35:25-255518 INFO     TRUNCATION LENGTH: 2048                                                                                                                  
    07:35:25-257179 INFO     INSTRUCTION TEMPLATE: "Alpaca"                                                                                                           
    07:35:25-258768 INFO     Loaded the model in 2.10 seconds.                                                                                                        
    07:35:50-221577 INFO     Loading JSON datasets                                                                                                                    
    Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:02<00:00, 741.50 examples/s]
    Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:02<00:00, 757.32 examples/s]
    07:36:00-266715 INFO     Getting model ready                                                                                                                      
    07:36:00-269457 INFO     Preparing for training                                                                                                                   
    07:36:00-273396 INFO     Creating LoRA model                                                                                                                      
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
    To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
    07:36:05-214395 INFO     Starting training                                                                                                                        
    Training 'opt' model using (q, v) projections
    Trainable params: 6,291,456 (0.4759 %), All params: 1,322,049,536 (Model: 1,315,758,080)
    07:36:05-275604 INFO     Log file 'train_dataset_sample.json' created in the 'logs' directory.                                                                    
    start training
    wandb: Tracking run with wandb version 0.16.3
    wandb: W&B syncing is set to `offline` in this directory.  
    wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
    [W AmpForeachNonFiniteCheckAndUnscaleKernelNpuOpApi.cpp:104] Warning: Non finite check and unscale on NPU device! (function operator())
    Step: 127 {'eval_loss': 2.1790549755096436, 'eval_runtime': 71.7705, 'eval_samples_per_second': 27.867, 'eval_steps_per_second': 3.483, 'epoch': 0.26}
    Step: 159 {'loss': 2.2859, 'learning_rate': 0.0002926829268292683, 'epoch': 0.32}
    Step: 255 {'eval_loss': 2.0011792182922363, 'eval_runtime': 65.6536, 'eval_samples_per_second': 30.463, 'eval_steps_per_second': 3.808, 'epoch': 0.51}
    Step: 319 {'loss': 2.0853, 'learning_rate': 0.0002560975609756097, 'epoch': 0.64}
    Step: 383 {'eval_loss': 1.9035134315490723, 'eval_runtime': 59.9999, 'eval_samples_per_second': 33.333, 'eval_steps_per_second': 4.167, 'epoch': 0.77}
    Step: 479 {'loss': 1.9536, 'learning_rate': 0.0002195121951219512, 'epoch': 0.96}
    Step: 491 {'eval_loss': 1.841735601425171, 'eval_runtime': 59.7168, 'eval_samples_per_second': 33.491, 'eval_steps_per_second': 4.186, 'epoch': 1.02}
    Step: 619 {'loss': 1.8656, 'learning_rate': 0.00018292682926829266, 'epoch': 1.28}
    Step: 619 {'eval_loss': 1.8136059045791626, 'eval_runtime': 60.3887, 'eval_samples_per_second': 33.119, 'eval_steps_per_second': 4.14, 'epoch': 1.28}
    Step: 747 {'eval_loss': 1.7972328662872314, 'eval_runtime': 60.1174, 'eval_samples_per_second': 33.268, 'eval_steps_per_second': 4.159, 'epoch': 1.54}
    Step: 779 {'loss': 1.8519, 'learning_rate': 0.00014634146341463414, 'epoch': 1.6}
    Step: 875 {'eval_loss': 1.7847641706466675, 'eval_runtime': 61.1648, 'eval_samples_per_second': 32.699, 'eval_steps_per_second': 4.087, 'epoch': 1.79}
    Step: 939 {'loss': 1.8669, 'learning_rate': 0.0001097560975609756, 'epoch': 1.92}
    Step: 1015 {'eval_loss': 1.774510383605957, 'eval_runtime': 60.883, 'eval_samples_per_second': 32.85, 'eval_steps_per_second': 4.106, 'epoch': 2.05}
    Step: 1111 {'loss': 1.8282, 'learning_rate': 7.317073170731707e-05, 'epoch': 2.24}
    Step: 1143 {'eval_loss': 1.7672193050384521, 'eval_runtime': 60.3961, 'eval_samples_per_second': 33.115, 'eval_steps_per_second': 4.139, 'epoch': 2.3}
    Step: 1271 {'loss': 1.7926, 'learning_rate': 3.6585365853658535e-05, 'epoch': 2.56}
    Step: 1271 {'eval_loss': 1.7625163793563843, 'eval_runtime': 60.0916, 'eval_samples_per_second': 33.282, 'eval_steps_per_second': 4.16, 'epoch': 2.56}
    Step: 1399 {'eval_loss': 1.760359764099121, 'eval_runtime': 63.8109, 'eval_samples_per_second': 31.343, 'eval_steps_per_second': 3.918, 'epoch': 2.82}
    Step: 1431 {'loss': 1.7982, 'learning_rate': 0.0, 'epoch': 2.88}
    Step: 1431 {'train_runtime': 1744.3223, 'train_samples_per_second': 3.44, 'train_steps_per_second': 0.026, 'train_loss': 1.9253530502319336, 'epoch': 2.88}
    end training
    08:05:11-100991 INFO     LoRA training run is completed and saved. 

@Malrama
Copy link

Malrama commented Feb 19, 2024

Why Huawai? What do NPUs have to do with Huawai? Is this some kind of advertisement?
I would strongly suggest renaming "Huawei Ascend" into simply "NPU". This is not specific to Huawai in any way.

@wangshuai09
Copy link
Contributor Author

wangshuai09 commented Feb 20, 2024

@Malrama I have just followed the behavior of Nvidia/AMD/Apple gpus. If there are another npu device wants to adapt, shold add another option such F) xxx which would install requirements for xxx. Thank you anyway, i will delete "Huawei" to avoid misunderstanding.
You can learn about Ascend from Ascend Community.

@wangshuai09 wangshuai09 marked this pull request as ready for review February 20, 2024 10:52
@OKN1212
Copy link

OKN1212 commented Feb 20, 2024

I think it's fine to leave Huawei, the code list chip vendors like AMD, Intel, and Apple but that doesn't mean it's endorsing said companies. In fact "Ascend" might be confusing down the line because that's like listing Apple chip as "M" and Intel as "Arc".

@wangshuai09
Copy link
Contributor Author

@OKN1212 thanks for your review. I think "Ascend" is better for user and could not make confusion like "Apple" because all Ascend series use the same version of torch and torch_npu. If there are different Ascend chip using different torch/torch_npu in the future, "Ascend" here may not be appropriate.

@wangshuai09
Copy link
Contributor Author

Hi @oobabooga , thank you for this great work! I find this is applicated for multiple backends and there are people who wants to use Ascend NPU. For Ascend NPU, it was supported by Pytorch officially since Pytorch 2.1 and people could use that for LLM training and inference.
This PR make it possible to use Ascend NPU. Could you please tell me what other things need me to do? If so, can you help to review this code :)

@Yikun
Copy link

Yikun commented Mar 11, 2024

Any plan to support this? @oobabooga Looks like @Malrama @OKN1212 's question has been resolved.

When I try to use text-generation-webui, I just noticed it haven't support yet, then I just found this PR, it works!

BTW, I also found stable-diffusion-webui 1.8 has been released [1], thanks for your work! @wangshuai09

[1] AUTOMATIC1111/stable-diffusion-webui@96b5504

@wangshuai09
Copy link
Contributor Author

Hi @oobabooga , this PR has been pending for over a month. May I know if adding new backends is supported?

@Touch-Night
Copy link
Contributor

@oobabooga
Text-generation-webui's goal is to become the AUTOMATIC1111/stable-diffusion-webui of text generation. Now stable-diffusion-webui supports Huawei Ascend NPU. Any reactions?

@oobabooga
Copy link
Owner

I am not familiar with NPUs and do not have access to one, so I have reverted the various changes to the one-click installer and removed temporary workarounds, keeping the changes minimal.

@oobabooga oobabooga merged commit fd4e46b into oobabooga:dev Apr 11, 2024
@Touch-Night
Copy link
Contributor

Then which one should I select when using one click installer if I have a Huawei Ascend NPU?

@wangshuai09
Copy link
Contributor Author

Thanks for your reply! I almost thought this job was going to fail.
Because one-click installer and workarounds are removed, I may need to write a wiki to apply patches for using Ascend backend.
In the future, i will try to supply a Ascend NPU device for running CI and other validation work.

@wangshuai09
Copy link
Contributor Author

wangshuai09 commented Apr 18, 2024

@Touch-Night I‘ve write a wiki for patch upstream as a temporary alternative. Althogh this is not fancy, it can support you guys use Ascend NPU.
https://github.com/wangshuai09/text-generation-webui/wiki/Install-and-run-on-Ascend-NPUs.

@Touch-Night
Copy link
Contributor

Touch-Night commented Apr 18, 2024

Thanks, but I have already reverted all changes by @oobabooga turning Huawei Ascend NPU support into a general NPU support in my own fork for localization.
By the way I found a typo in your wiki:

Preparation: Before installing stable-diffusion-webui for NPU, you should make sure that have installed the right CANN toolkit and kernels.

cd text-generatioin-webui

@Ginray
Copy link

Ginray commented May 27, 2024

nice job

PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants