cann: add Ascend NPU support #2336

MengqingCao · 2024-08-06T07:21:01Z

This PR enables users to leverage the Ascend NPU for inferencing whisper model on whisper.cpp.

Mainly changes

Ascend NPU is already supported by Add Ascend NPU as a new backend llama.cpp#6034 in llama.cpp. CANN related codes in llama.cpp are migrated to this project by this PR.
Necessary changes to utilize Ascend NPU are made in src/whisper.cpp, ggml/CMakeLists.txt, etc.

Build with CANN

Using the following command to build whisper.cpp with CANN:

mkdir build
cd build
cmake .. -D GGML_CANN=on
make -j

ASR Inference

Inference test on whisper base model (ggml-base.en.bin downloaded at https://huggingface.co/ggerganov/whisper.cpp/tree/main):

./build/bin/main -f samples/jfk.wav -m models/ggml-base.en.bin -t 8

Inference result:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init_gpu: using CANN backend
whisper_init_state: kv self size  =   18.87 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.75 MB
whisper_init_state: compute buffer (encode) =  131.94 MB
whisper_init_state: compute buffer (cross)  =    5.17 MB
whisper_init_state: compute buffer (decode) =  153.13 MB

system_info: n_threads = 8 / 192 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 1

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   223.83 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    19.95 ms
whisper_print_timings:   sample time =    94.43 ms /   131 runs (    0.72 ms per run)
whisper_print_timings:   encode time =   632.05 ms /     1 runs (  632.05 ms per run)
whisper_print_timings:   decode time =    56.30 ms /     2 runs (   28.15 ms per run)
whisper_print_timings:   batchd time =   930.68 ms /   125 runs (    7.45 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  2854.32 ms

ASR inferece for longer speech (https://upload.wikimedia.org/wikipedia/en/d/d4/En.henryfphillips.ogg):

MengqingCao · 2024-08-06T07:22:07Z

cc @ggerganov @hipudding

hipudding · 2024-08-07T06:37:29Z

ggml/include/ggml.h

-            abort(); \
-        } \
-    } while (0)
+#define GGML_ABORT(...) ggml_abort(__FILE__, __LINE__, __VA_ARGS__)


Please describe the reason why modify this part.

This follows the refactor of GGML_ASSERT and the addition of GGML_ABORT in ggerganov/llama.cpp#8698.
GGML_ABORT is ultilized in CANN related code to abort the process with a message.

hipudding · 2024-08-07T06:39:13Z

tests/test-backend-ops.cpp

@@ -760,7 +762,7 @@ struct test_dup : public test_case {
    }

    test_dup(ggml_type type = GGML_TYPE_F32,
-            std::array<int64_t, 4> ne = {10, 10, 10, 1},
+            std::array<int64_t, 4> ne = {10, 10, 20, 1},


Make sure thers's no existing scenes are missed

This modification causes ne and nb to change after permute, thus covering test_dup of non-continuous data. Keeping same with https://github.com/ggerganov/llama.cpp/blob/master/tests/test-backend-ops.cpp#L770

hipudding · 2024-08-08T08:26:54Z

@ggerganov Could you please view this PR. The main code comes from Ascend NPU implementation in llama.cpp.
We also want to support whisper.cpp with Ascend backend. Thanks.

ggerganov · 2024-08-08T10:02:19Z

Thanks for the PR. I need to first sync the latest ggml repository into whisper.cpp. The sync will bring most of the changes from this PR and after that, we will apply the rest. Hopefully will make the sync today or tomorrow

MengqingCao · 2024-08-08T10:49:36Z

Thanks for the PR. I need to first sync the latest ggml repository into whisper.cpp. The sync will bring most of the changes from this PR and after that, we will apply the rest. Hopefully will make the sync today or tomorrow

Thanks! That‘s great! Please @ me after the sync of ggml is done, and I'll rebase the commit then.

ggerganov · 2024-08-09T06:58:57Z

@MengqingCao The sync is now done - please update as necessary

* enable Ascend NPU in src/whisper.cpp * sync test-backend-ops with llama.cpp

MengqingCao · 2024-08-09T08:48:14Z

@MengqingCao The sync is now done - please update as necessary

Hi @ggerganov, thanks for your work! This PR is updated now, please review it.

BTW, I sync tests/test-backend-ops.cpp with llama.cpp so that the test cases added when adding Ascend NPU backend in llama.cpp will be included here.

ggerganov

Thanks!

Consider adding instructions to the readme for using this backend in follow-up PRs

hipudding · 2024-08-09T12:25:27Z

Thanks!

Consider adding instructions to the readme for using this backend in follow-up PRs

Sure. We will.

* ggerganov/master: (118 commits) cann : add Ascend NPU support (ggerganov#2336) whisper : fix compile warning (#0) sync : ggml ggml : add CANN backend (llama/0) scripts : sync cann ci : disable ruby workflow (#0) ci : try to fix FreeBSD (#0) build : fix aarch64 (#0) talk-llama : sync llama.cpp sync : ggml ggml-backend : fix async copy from CPU (llama/8897) Updated SYCL device filtering (llama/8901) CUDA/HIP: fix tests/test-backend-ops (llama/8896) CUDA: fix padding logic for FP16/FP32 (llama/8884) ggml : add epsilon as a parameter for group_norm (llama/8818) ggml : fix overflows in elu function (llama/8866) ggml : reading the runtime sve config of the cpu (llama/8709) Fix conversion of unnormalized BF16->BF16 weights (llama/7843) Fixing wrong VDR iq4nl value (llama/8812) ggml-cuda: Adding support for unified memory (llama/8035) ...

* master: (119 commits) cann : add Ascend NPU support (ggerganov#2336) whisper : fix compile warning (#0) sync : ggml ggml : add CANN backend (llama/0) scripts : sync cann ci : disable ruby workflow (#0) ci : try to fix FreeBSD (#0) build : fix aarch64 (#0) talk-llama : sync llama.cpp sync : ggml ggml-backend : fix async copy from CPU (llama/8897) Updated SYCL device filtering (llama/8901) CUDA/HIP: fix tests/test-backend-ops (llama/8896) CUDA: fix padding logic for FP16/FP32 (llama/8884) ggml : add epsilon as a parameter for group_norm (llama/8818) ggml : fix overflows in elu function (llama/8866) ggml : reading the runtime sve config of the cpu (llama/8709) Fix conversion of unnormalized BF16->BF16 weights (llama/7843) Fixing wrong VDR iq4nl value (llama/8812) ggml-cuda: Adding support for unified memory (llama/8035) ...

lq0104 · 2024-08-19T01:51:07Z

Great work. I have a question, does it support the Ascend 310P3 chip now? @MengqingCao @hipudding

hipudding · 2024-08-19T02:13:17Z

Great work. I have a question, does it support the Ascend 310P3 chip now? @MengqingCao @hipudding

Not support 310 now. But I think it's easy to support 310p by making some small change, If you are interest in this project, Please open a Pull Request.

lq0104 · 2024-08-22T03:14:08Z

I attempted to run on the 310P3 chip and encountered an issue, involving error messages. I've opened a new issue, could you help me identify where the problem lies? #2372

* enable Ascend NPU in src/whisper.cpp * sync test-backend-ops with llama.cpp

hipudding reviewed Aug 7, 2024

View reviewed changes

cann: enable Ascend NPU support

f66b806

* enable Ascend NPU in src/whisper.cpp * sync test-backend-ops with llama.cpp

MengqingCao force-pushed the npu_support branch from 54f8c4c to f66b806 Compare August 9, 2024 08:40

ggerganov approved these changes Aug 9, 2024

View reviewed changes

ggerganov merged commit 81c999f into ggerganov:master Aug 9, 2024
46 checks passed

iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024

cann : add Ascend NPU support (ggerganov#2336)

48d914e

* enable Ascend NPU in src/whisper.cpp * sync test-backend-ops with llama.cpp

iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024

cann : add Ascend NPU support (ggerganov#2336)

266d412

* enable Ascend NPU in src/whisper.cpp * sync test-backend-ops with llama.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cann: add Ascend NPU support #2336

cann: add Ascend NPU support #2336

MengqingCao commented Aug 6, 2024

MengqingCao commented Aug 6, 2024

hipudding Aug 7, 2024 •

edited

Loading

MengqingCao Aug 7, 2024

hipudding Aug 7, 2024

MengqingCao Aug 7, 2024

hipudding commented Aug 8, 2024

ggerganov commented Aug 8, 2024

MengqingCao commented Aug 8, 2024

ggerganov commented Aug 9, 2024

MengqingCao commented Aug 9, 2024

ggerganov left a comment

hipudding commented Aug 9, 2024

lq0104 commented Aug 19, 2024

hipudding commented Aug 19, 2024 •

edited

Loading

lq0104 commented Aug 22, 2024

cann: add Ascend NPU support #2336

cann: add Ascend NPU support #2336

Conversation

MengqingCao commented Aug 6, 2024

Mainly changes

Build with CANN

ASR Inference

MengqingCao commented Aug 6, 2024

hipudding Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

MengqingCao Aug 7, 2024

Choose a reason for hiding this comment

hipudding Aug 7, 2024

Choose a reason for hiding this comment

MengqingCao Aug 7, 2024

Choose a reason for hiding this comment

hipudding commented Aug 8, 2024

ggerganov commented Aug 8, 2024

MengqingCao commented Aug 8, 2024

ggerganov commented Aug 9, 2024

MengqingCao commented Aug 9, 2024

ggerganov left a comment

Choose a reason for hiding this comment

hipudding commented Aug 9, 2024

lq0104 commented Aug 19, 2024

hipudding commented Aug 19, 2024 • edited Loading

lq0104 commented Aug 22, 2024

hipudding Aug 7, 2024 •

edited

Loading

hipudding commented Aug 19, 2024 •

edited

Loading