[js/webgpu] Destroy staging buffers aggressively during weights uploading #22726

gyagp · 2024-11-05T06:22:51Z

In current implementation, all the staging buffers for weights uploading are destroyed after first batch of kernel execution. It requires a lot of memory as all the staging buffers couldn't be reused. It also hurts the startup time (weights uploading only happens in session creation), as weights uploading is delayed to a very late time.
This PR uses a very aggressive way to submit queue and destroy staging buffers, so that the related GPU memory could be reused as much as possible, though the real situation depends on the WebGPU and driver implementation. The aggressive queue submission also moves GPU operations to a very early time, which helps the startup time.
Some buffer uploading benchmarks are composed to compare multiple solutions, regarding to the memory and time consumption. Benchmarks can be found at https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html, while detailed test data can be found at
https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit. I also tested phi3.5 on 2 machines, first inference time improved from 5141ms to 3579ms and from 4327ms to 2947ms separately.

…ding In current implmentation, all the staging buffers for weights uploading are destoryed after first batch of kernel execution. It requires a lot of memory as all the staging buffers couldn't be reused. It also hurts the startup time (weights uploading only happens in session creation), as weights uploading is delayed to a very late time. This PR uses a very aggressive way to submit queue and destroy staging buffers, so that the related gpu memory could be reused as much as possible, though the real situation depends on the WebGPU and driver implementation. The aggressive queue submission also moves gpu operations to a very early time, which helps the startup time. Some buffer uploading benchmarks are composed to compare multiple solutions, regarding to the memory and time consumption. Benchmarks can be found at https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html, while detailed test data can be found at https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit. I also tested phi3.5 on 2 machines, first inference time improved from 5141ms to 3579ms and from 4327ms to 2947ms separately.

gyagp · 2024-11-05T06:23:27Z

@qjia7, @fs-eire @guschmue PTAL

js/web/lib/wasm/jsep/webgpu/gpu-data-manager.ts

fs-eire · 2024-11-06T10:12:59Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

fs-eire · 2024-11-06T10:13:00Z

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

fs-eire · 2024-11-06T10:13:02Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

azure-pipelines · 2024-11-06T10:13:16Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-11-06T10:13:16Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-11-06T10:13:17Z

Azure Pipelines successfully started running 1 pipeline(s).

…ding (microsoft#22726) In current implementation, all the staging buffers for weights uploading are destroyed after first batch of kernel execution. It requires a lot of memory as all the staging buffers couldn't be reused. It also hurts the startup time (weights uploading only happens in session creation), as weights uploading is delayed to a very late time. This PR uses a very aggressive way to submit queue and destroy staging buffers, so that the related GPU memory could be reused as much as possible, though the real situation depends on the WebGPU and driver implementation. The aggressive queue submission also moves GPU operations to a very early time, which helps the startup time. Some buffer uploading benchmarks are composed to compare multiple solutions, regarding to the memory and time consumption. Benchmarks can be found at https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html, while detailed test data can be found at https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit. I also tested phi3.5 on 2 machines, first inference time improved from 5141ms to 3579ms and from 4327ms to 2947ms separately.

…ding (#22726) In current implementation, all the staging buffers for weights uploading are destroyed after first batch of kernel execution. It requires a lot of memory as all the staging buffers couldn't be reused. It also hurts the startup time (weights uploading only happens in session creation), as weights uploading is delayed to a very late time. This PR uses a very aggressive way to submit queue and destroy staging buffers, so that the related GPU memory could be reused as much as possible, though the real situation depends on the WebGPU and driver implementation. The aggressive queue submission also moves GPU operations to a very early time, which helps the startup time. Some buffer uploading benchmarks are composed to compare multiple solutions, regarding to the memory and time consumption. Benchmarks can be found at https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html, while detailed test data can be found at https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit. I also tested phi3.5 on 2 machines, first inference time improved from 5141ms to 3579ms and from 4327ms to 2947ms separately.

gyagp marked this pull request as draft November 5, 2024 07:04

Use a standalone commandEncoder

62f1be0

gyagp marked this pull request as ready for review November 5, 2024 07:34

fs-eire reviewed Nov 5, 2024

View reviewed changes

js/web/lib/wasm/jsep/webgpu/gpu-data-manager.ts Show resolved Hide resolved

fs-eire approved these changes Nov 6, 2024

View reviewed changes

guschmue approved these changes Nov 6, 2024

View reviewed changes

guschmue added the ep:WebGPU ort-web webgpu provider label Nov 6, 2024

guschmue merged commit 811231e into microsoft:main Nov 6, 2024
60 checks passed

guschmue mentioned this pull request Nov 26, 2024

free staging buffer early #22943

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[js/webgpu] Destroy staging buffers aggressively during weights uploading #22726

[js/webgpu] Destroy staging buffers aggressively during weights uploading #22726

gyagp commented Nov 5, 2024

gyagp commented Nov 5, 2024

fs-eire commented Nov 6, 2024

fs-eire commented Nov 6, 2024

fs-eire commented Nov 6, 2024

azure-pipelines bot commented Nov 6, 2024

azure-pipelines bot commented Nov 6, 2024

azure-pipelines bot commented Nov 6, 2024

[js/webgpu] Destroy staging buffers aggressively during weights uploading #22726

[js/webgpu] Destroy staging buffers aggressively during weights uploading #22726

Conversation

gyagp commented Nov 5, 2024

gyagp commented Nov 5, 2024

fs-eire commented Nov 6, 2024

fs-eire commented Nov 6, 2024

fs-eire commented Nov 6, 2024

azure-pipelines bot commented Nov 6, 2024

azure-pipelines bot commented Nov 6, 2024

azure-pipelines bot commented Nov 6, 2024