Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to release gpu memory when keep onnxruntime session around. #9509

Open
Z-XQ opened this issue Oct 23, 2021 · 12 comments
Open

how to release gpu memory when keep onnxruntime session around. #9509

Z-XQ opened this issue Oct 23, 2021 · 12 comments
Labels
ep:CUDA issues related to the CUDA execution provider feature request request for unsupported feature or enhancement

Comments

@Z-XQ
Copy link

Z-XQ commented Oct 23, 2021

I want to release GPU memory in time and keep the session running. Thank you!

@hariharans29
Copy link
Member

Can you please elaborate about your scenario ? What exactly do you mean by "release GPU memory in time and keep the session running" ? Do you mean you want to shrink any GPU memory arena associated with the session periodically while still keeping the session alive ?

@hariharans29 hariharans29 added the ep:CUDA issues related to the CUDA execution provider label Oct 25, 2021
@Z-XQ Z-XQ closed this as completed Oct 25, 2021
@Z-XQ Z-XQ reopened this Oct 25, 2021
@Z-XQ
Copy link
Author

Z-XQ commented Oct 25, 2021

Can you please elaborate about your scenario ? What exactly do you mean by "release GPU memory in time and keep the session running" ? Do you mean you want to shrink any GPU memory arena associated with the session periodically while still keeping the session alive ?

Thank you for your reply! The following is my scenario.

My gpu is 3090. 708M gpu memory is used before open an onnxruntime session.
Then I use the following to open a session.
ort_session = onnxruntime.InferenceSession(model_path)
The gpu memory becomes used about 1.7g

When infers one image as following, the gpu memory becomes used about 2.0g. And the number will not decline when the infererence operation is over.
seg_raw_output = tmp.run([], {self.seg_input_name: seg_input_data})[0]

Therefore, I want to release some gpu memory occupation to make the occupation return to 1.7g while still keeping the session alive.

Thank you! I am looking forward to your reply!

@hariharans29
Copy link
Member

Thanks @Z-XQ for the explanation.

The GPU memory is backed by a memory pool (arena) and we have a config knob to shrink the arena (de-allocated unused memory chunks).

Not sure if we have enough tools to accomplish this in Python just yet. The best way to use this feature in C++ is to:

  1. Not allocate weights memory through the arena: See here

  2. Configure the arena to have high enough initial chunk to support most Run() calls. See "initial_chunk_size_bytes" here

  3. Finally, configure the arena to shrink on every Run(). See here. This will keep the initial chunk allocated but de-allocate any unused chunk remaining after the Run() call ends.

For example, if the initial chunk size is set as 500MB, the first Run() will allocate 500MB + any additional chunks required to service the Run() call. The additional chunks will get de-allocated after Run() and only keep 500MB of memory allocated. It is important to not allocate weights (initializers) memory through the arena as that complicates the shrinkage. Hence, step (1).

@Z-XQ
Copy link
Author

Z-XQ commented Oct 26, 2021

Thanks @Z-XQ for the explanation.

The GPU memory is backed by a memory pool (arena) and we have a config knob to shrink the arena (de-allocated unused memory chunks).

Not sure if we have enough tools to accomplish this in Python just yet. The best way to use this feature in C++ is to:

  1. Not allocate weights memory through the arena: See here
  2. Configure the arena to have high enough initial chunk to support most Run() calls. See "initial_chunk_size_bytes" here
  3. Finally, configure the arena to shrink on every Run(). See here. This will keep the initial chunk allocated but de-allocate any unused chunk remaining after the Run() call ends.

For example, if the initial chunk size is set as 500MB, the first Run() will allocate 500MB + any additional chunks required to service the Run() call. The additional chunks will get de-allocated after Run() and only keep 500MB of memory allocated. It is important to not allocate weights (initializers) memory through the arena as that complicates the shrinkage. Hence, step (1).

Thanks a lot! It's too hard to convert to c++ deployment in a short time. I'll figure out other ways by using python code.

@hariharans29
Copy link
Member

Thanks. We will need to support configuring the arena in Python. So, I will mark it an enhancement.

@hariharans29 hariharans29 added the feature request request for unsupported feature or enhancement label Oct 26, 2021
@zlbdzhh
Copy link

zlbdzhh commented May 11, 2022

When I operated according to what you said, I reported this error in multi gpu
"Did not find an arena based allocator registered for device-id combination in the memory arena shrink list: gpu:1",how can
i do it

@codender
Copy link

Thanks. We will need to support configuring the arena in Python. So, I will mark it an enhancement.

Hi, can i release gpu mem in python now?

@chinmayjog13
Copy link

Hi, any update on this issue?

@yingfenging
Copy link

@hariharans29 Hi, can i release gpu mem in python now?

@bluddy
Copy link

bluddy commented Nov 6, 2023

I'd like to know if there is a way to limit the growth of memory usage, especially on the GPU.
We're making serious usage of onnxruntime and need to know if we can rely on it in a python-based system.

@dawenxi-only
Copy link

Is there any update? I am also facing the issue of freeing memory from onnxruntime in Python

@yiluzhuimeng
Copy link

Is there any update? I am also facing the issue of freeing memory from onnxruntime in Python,too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider feature request request for unsupported feature or enhancement
Projects
None yet
Development

No branches or pull requests

9 participants