Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] #13500

Open
igormis opened this issue Oct 29, 2022 · 10 comments
Open

[Performance] #13500

igormis opened this issue Oct 29, 2022 · 10 comments
Labels
ep:CUDA issues related to the CUDA execution provider

Comments

@igormis
Copy link

igormis commented Oct 29, 2022

Describe the issue

GPU RAM gets exhausted during inference (after a certain number of calls), i.e. it keeps increasing randomly

To reproduce

Reproduction instructions

sess_options = rt.SessionOptions()
sess_options.enable_mem_pattern = False
sess_options.enable_cpu_mem_arena = False
cuda_provider_options = {"arena_extend_strategy": "kSameAsRequested"}
execution_providers = [("CUDAExecutionProvider", cuda_provider_options), "CPUExecutionProvider"]
sess = rt.InferenceSession("roberta_onnx_model/__MODEL_PROTO.onnx",sess_options, providers=execution_providers)
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
logits = sess.run([label_name], {input_name: X_test_sample})[0]
...

Urgency

No response

Platform

Linux

OS Version

20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu==1.11.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4

Model File

The models is tensorflow model (RoBERTa)

Is this a quantized model?

Unknown

@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Oct 29, 2022
@igormis
Copy link
Author

igormis commented Oct 31, 2022

Thanks @yuslepukhin.
It is still increasing, for instance I have started with 3 models:
image
and after certain number of requests the RAM gets exhausted:
image

@igormis
Copy link
Author

igormis commented Nov 1, 2022

I have tries the following code also:
cuda_provider_options = {"arena_extend_strategy": "kSameAsRequested", "gpu_mem_limit": 3221225472, "cudnn_conv_use_max_workspace": "1"} but it still grows

@pranavsharma
Copy link
Contributor

pranavsharma commented Nov 4, 2022

The arena allocator used by the CUDA EP doesn't shrink by itself. You can configure it to shrink after every Run. See the Memory arena shrinkage section here https://onnxruntime.ai/docs/get-started/with-c.html. Also read #9509 (comment).

@igormis
Copy link
Author

igormis commented Nov 5, 2022

@pranavsharma tnx a lot, it is perfect direction. I have these settings for EP in Python:

sess_options = rt.SessionOptions()

sess_options.enable_mem_pattern = False
sess_options.enable_cpu_mem_arena = False

cuda_provider_options = {"arena_extend_strategy": "kSameAsRequested", "cudnn_conv_use_max_workspace": "1"}
cpu_provider_options = {"arena_extend_strategy": "kSameAsRequested"}
execution_providers = [("CUDAExecutionProvider", cuda_provider_options), "CPUExecutionProvider"]
sess = rt.InferenceSession("roberta_onnx_model/__MODEL_PROTO.onnx", providers=execution_providers)

So does this mean that for the RunOptions I need to do something like this:

run_options = rt.RunOptions()
run_options.add_run_config_entry("kOrtRunOptionsConfigEnableMemoryArenaShrinkage", "gpu:0")
logits = sess.run([label_name], {input_name: X_test_sample}, run_options)[0]

@pranavsharma is this correct? I read the documentation, but still I'm a little bit confused

@pranavsharma
Copy link
Contributor

Yeah, that sounds right. Please look at the test code as well.

TEST(CApiTest, ConfigureCudaArenaAndDemonstrateMemoryArenaShrinkage) {

@igormis
Copy link
Author

igormis commented Nov 28, 2022

The settings reduced the memory growth, but still it increases after certain time, is it possible to make it somehow fixed?

@igormis
Copy link
Author

igormis commented Nov 28, 2022

After playing with the code and following the code that you have send me, this are the final settings:

sess_options = rt.SessionOptions()



cuda_provider_options = {"arena_extend_strategy": "kNextPowerOfTwo", "do_copy_in_default_stream": False}
cpu_provider_options = {"arena_extend_strategy": "kNextPowerOfTwo", "do_copy_in_default_stream": False}

execution_providers = [("CUDAExecutionProvider", cuda_provider_options), ("CPUExecutionProvider", cpu_provider_options)]
sess = rt.InferenceSession("roberta_onnx_model/__MODEL_PROTO.onnx", providers=execution_providers)

run_options = rt.RunOptions()
run_options.add_run_config_entry("kOrtRunOptionsConfigEnableMemoryArenaShrinkage", "cpu:0,gpu:0")

logits = sess.run([label_name], {input_name: X_test_sample}, run_options)[0]

DO not know what else to set/change

@bluddy
Copy link

bluddy commented Nov 6, 2023

I'd like to know if this really limits the growth of memory usage, especially on the GPU.
We're making serious usage of onnxruntime and need to know if we can rely on it in a python-based system.

@niyathimariya
Copy link

Hi, is there any update on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider
Projects
None yet
Development

No branches or pull requests

5 participants