Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cann: add Ascend NPU support #2336

Merged
merged 1 commit into from
Aug 9, 2024
Merged

Conversation

MengqingCao
Copy link
Contributor

This PR enables users to leverage the Ascend NPU for inferencing whisper model on whisper.cpp.

Mainly changes

  1. Ascend NPU is already supported by Add Ascend NPU as a new backend llama.cpp#6034 in llama.cpp. CANN related codes in llama.cpp are migrated to this project by this PR.
  2. Necessary changes to utilize Ascend NPU are made in src/whisper.cpp, ggml/CMakeLists.txt, etc.

Build with CANN

Using the following command to build whisper.cpp with CANN:

mkdir build
cd build
cmake .. -D GGML_CANN=on
make -j

ASR Inference

Inference test on whisper base model (ggml-base.en.bin downloaded at https://huggingface.co/ggerganov/whisper.cpp/tree/main):

./build/bin/main -f samples/jfk.wav -m models/ggml-base.en.bin -t 8

Inference result:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init_gpu: using CANN backend
whisper_init_state: kv self size  =   18.87 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.75 MB
whisper_init_state: compute buffer (encode) =  131.94 MB
whisper_init_state: compute buffer (cross)  =    5.17 MB
whisper_init_state: compute buffer (decode) =  153.13 MB

system_info: n_threads = 8 / 192 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 1

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   223.83 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    19.95 ms
whisper_print_timings:   sample time =    94.43 ms /   131 runs (    0.72 ms per run)
whisper_print_timings:   encode time =   632.05 ms /     1 runs (  632.05 ms per run)
whisper_print_timings:   decode time =    56.30 ms /     2 runs (   28.15 ms per run)
whisper_print_timings:   batchd time =   930.68 ms /   125 runs (    7.45 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  2854.32 ms

ASR inferece for longer speech (https://upload.wikimedia.org/wikipedia/en/d/d4/En.henryfphillips.ogg):

whisper_cann_hp0_t8

@MengqingCao
Copy link
Contributor Author

cc @ggerganov @hipudding

abort(); \
} \
} while (0)
#define GGML_ABORT(...) ggml_abort(__FILE__, __LINE__, __VA_ARGS__)
Copy link
Contributor

@hipudding hipudding Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe the reason why modify this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the refactor of GGML_ASSERT and the addition of GGML_ABORT in ggerganov/llama.cpp#8698.
GGML_ABORT is ultilized in CANN related code to abort the process with a message.

@@ -760,7 +762,7 @@ struct test_dup : public test_case {
}

test_dup(ggml_type type = GGML_TYPE_F32,
std::array<int64_t, 4> ne = {10, 10, 10, 1},
std::array<int64_t, 4> ne = {10, 10, 20, 1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure thers's no existing scenes are missed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modification causes ne and nb to change after permute, thus covering test_dup of non-continuous data. Keeping same with https://github.com/ggerganov/llama.cpp/blob/master/tests/test-backend-ops.cpp#L770

@hipudding
Copy link
Contributor

@ggerganov Could you please view this PR. The main code comes from Ascend NPU implementation in llama.cpp.
We also want to support whisper.cpp with Ascend backend. Thanks.

@ggerganov
Copy link
Owner

Thanks for the PR. I need to first sync the latest ggml repository into whisper.cpp. The sync will bring most of the changes from this PR and after that, we will apply the rest. Hopefully will make the sync today or tomorrow

@MengqingCao
Copy link
Contributor Author

Thanks for the PR. I need to first sync the latest ggml repository into whisper.cpp. The sync will bring most of the changes from this PR and after that, we will apply the rest. Hopefully will make the sync today or tomorrow

Thanks! That‘s great! Please @ me after the sync of ggml is done, and I'll rebase the commit then.

@ggerganov
Copy link
Owner

@MengqingCao The sync is now done - please update as necessary

  * enable Ascend NPU in src/whisper.cpp
  * sync test-backend-ops with llama.cpp
@MengqingCao
Copy link
Contributor Author

@MengqingCao The sync is now done - please update as necessary

Hi @ggerganov, thanks for your work! This PR is updated now, please review it.

BTW, I sync tests/test-backend-ops.cpp with llama.cpp so that the test cases added when adding Ascend NPU backend in llama.cpp will be included here.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Consider adding instructions to the readme for using this backend in follow-up PRs

@ggerganov ggerganov merged commit 81c999f into ggerganov:master Aug 9, 2024
46 checks passed
@hipudding
Copy link
Contributor

Thanks!

Consider adding instructions to the readme for using this backend in follow-up PRs

Sure. We will.

bygreencn added a commit to bygreencn/whisper.cpp that referenced this pull request Aug 12, 2024
* ggerganov/master: (118 commits)
  cann : add Ascend NPU support (ggerganov#2336)
  whisper : fix compile warning (#0)
  sync : ggml
  ggml : add CANN backend (llama/0)
  scripts : sync cann
  ci : disable ruby workflow (#0)
  ci : try to fix FreeBSD (#0)
  build : fix aarch64 (#0)
  talk-llama : sync llama.cpp
  sync : ggml
  ggml-backend : fix async copy from CPU (llama/8897)
  Updated SYCL device filtering (llama/8901)
  CUDA/HIP: fix tests/test-backend-ops (llama/8896)
  CUDA: fix padding logic for FP16/FP32 (llama/8884)
  ggml : add epsilon as a parameter for group_norm (llama/8818)
  ggml : fix overflows in elu function (llama/8866)
  ggml : reading the runtime sve config of the cpu (llama/8709)
  Fix conversion of unnormalized BF16->BF16 weights (llama/7843)
  Fixing wrong VDR iq4nl value (llama/8812)
  ggml-cuda: Adding support for unified memory (llama/8035)
  ...
bygreencn added a commit to bygreencn/whisper.cpp that referenced this pull request Aug 12, 2024
* master: (119 commits)
  cann : add Ascend NPU support (ggerganov#2336)
  whisper : fix compile warning (#0)
  sync : ggml
  ggml : add CANN backend (llama/0)
  scripts : sync cann
  ci : disable ruby workflow (#0)
  ci : try to fix FreeBSD (#0)
  build : fix aarch64 (#0)
  talk-llama : sync llama.cpp
  sync : ggml
  ggml-backend : fix async copy from CPU (llama/8897)
  Updated SYCL device filtering (llama/8901)
  CUDA/HIP: fix tests/test-backend-ops (llama/8896)
  CUDA: fix padding logic for FP16/FP32 (llama/8884)
  ggml : add epsilon as a parameter for group_norm (llama/8818)
  ggml : fix overflows in elu function (llama/8866)
  ggml : reading the runtime sve config of the cpu (llama/8709)
  Fix conversion of unnormalized BF16->BF16 weights (llama/7843)
  Fixing wrong VDR iq4nl value (llama/8812)
  ggml-cuda: Adding support for unified memory (llama/8035)
  ...
@lq0104
Copy link

lq0104 commented Aug 19, 2024

Great work. I have a question, does it support the Ascend 310P3 chip now? @MengqingCao @hipudding

@hipudding
Copy link
Contributor

hipudding commented Aug 19, 2024

Great work. I have a question, does it support the Ascend 310P3 chip now? @MengqingCao @hipudding

Not support 310 now. But I think it's easy to support 310p by making some small change, If you are interest in this project, Please open a Pull Request.

@lq0104
Copy link

lq0104 commented Aug 22, 2024

I attempted to run on the 310P3 chip and encountered an issue, involving error messages. I've opened a new issue, could you help me identify where the problem lies? #2372

iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* enable Ascend NPU in src/whisper.cpp
  * sync test-backend-ops with llama.cpp
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* enable Ascend NPU in src/whisper.cpp
  * sync test-backend-ops with llama.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants