ONNX 推論時のメモリ消費量を大幅に削減 & 前回 PR の実装ミスの修正 #194

tsukumijima · 2024-12-24T18:58:01Z

表題の通りです。

1. ONNX 推論時のメモリ消費量を大幅に削減

ONNX 版の BERT モデルを FP16 に変換したところ、音声合成品質にはほとんど影響なくファイルサイズやメモリ消費量を節約できることが判明したため、initialize.py 実行時にダウンロードされる ONNX 版 BERT モデルを FP16 版に切り替えました。

また、ONNX + CPU 推論時に発生していたメモリリーク問題（推論処理を行えば行うほどメモリが圧迫されていく）に関して、ONNX Runtime のデフォルトが富豪的にメモリを消費する構成になっていることと、Transformer アーキテクチャである BERT が入力系列長に応じてメモリを消費することが原因と判明しました。
こちらは ONNX Runtime の SessionOptions・RunOptions を調整することで解決しており、大幅なメモリ消費量削減を実現しています。

Note

この変更により BERT モデルの推論速度が Core i5-13400 環境で 0.2 ~ 0.3 秒程度低下するトレードオフがありますが、CPU 推論は元々 CUDA 推論に比べてあまり早くないことから、メモリ消費量の削減を優先しました。
なお、アプリケーション側で onnx_bert_models.load_model() 実行時に enable_cpu_mem_arena=True を渡すことで、BERT モデルの CPU 推論時にメモリアリーナが構築され、従来通りのパフォーマンスを維持できます（その代わり、若干緩和されているものの依然メモリを食います）。

前回 PR の実装ミスの修正

前回のプルリクエストで修正できていなかった実装ミスの修正となります。

TTSModel.unload() 実行時に PyTorch がインストールされていないと (ONNX 推論のみ利用する状態でも) エラーが発生する問題を修正
ONNX 版の英語の BERT モデル (deberta-v3-large) のトークナイザーの Fast Tokenizer への変換処理 (convert_bert_onnx.py 内) の実装が誤っており、結果英語の g2p 処理がうまく行われない（未知語が「unknown」と読み上げられてしまうなど）問題を修正

そのほか

テストコードが単一文字列での音声合成のみをテストする形でパフォーマンス計測上不都合だったことから、複数文を音声合成する形に変更しました。
また、convert_bert_onnx.py では FP32 版と FP16 版両方の ONNX BERT モデルを変換できるように改良してあります。

以上、よろしくお願いいたします。

g2pの内容をjsonで取れるAPI

Add /g2p (Update server_fastapi.py)

…orrect, so tokenization was not performed correctly

…tion very slow

I have found that half-precision has little effect on speech synthesis quality and, depending on the environment, can reduce file size and memory usage by half, so I have decided to use FP16.

…mory after inference

…nt excessive memory consumption during the inference session of the BERT model

aka7774 and others added 13 commits November 10, 2024 15:27

Update server_fastapi.py

3155e2a

g2pの内容をjsonで取れるAPI

Merge pull request litagin02#177 from aka7774/master

065a7ff

Add /g2p (Update server_fastapi.py)

Merge branch 'litagin02:master' into master

e642a4c

Improve: Make it possible to convert BERT language models to FP16

c1fce3f

Fix: The conversion script for English BERT to Fast Tokenizer was inc…

2833fb5

…orrect, so tokenization was not performed correctly

Fix: spm.model is missing

ebd249b

Fix: Use Fast Tokenizer instead of Slow Tokenizer, which makes valida…

e1ce12a

…tion very slow

Fix: TTSModel.unload() did not work in PyTorch-independent environments

b843804

Improve: Convert ONNX version of BERT models to FP16

15af441

I have found that half-precision has little effect on speech synthesis quality and, depending on the environment, can reduce file size and memory usage by half, so I have decided to use FP16.

Refactor: Use I/O Binding during BERT inference and always release me…

08c439e

…mory after inference

Improve: Adjusted the default Execution Provider options

405c5de

Improve: Variation of text-to-speech during testing

3c218c7

Improve: Disable enable_cpu_mem_arena for CPU inference only to preve…

810ca43

…nt excessive memory consumption during the inference session of the BERT model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX 推論時のメモリ消費量を大幅に削減 & 前回 PR の実装ミスの修正 #194

ONNX 推論時のメモリ消費量を大幅に削減 & 前回 PR の実装ミスの修正 #194

tsukumijima commented Dec 24, 2024

ONNX 推論時のメモリ消費量を大幅に削減 & 前回 PR の実装ミスの修正 #194

Are you sure you want to change the base?

ONNX 推論時のメモリ消費量を大幅に削減 & 前回 PR の実装ミスの修正 #194

Conversation

tsukumijima commented Dec 24, 2024

1. ONNX 推論時のメモリ消費量を大幅に削減

前回 PR の実装ミスの修正

そのほか