Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] #13492

Open
gdw439 opened this issue Oct 28, 2022 · 1 comment
Open

[Performance] #13492

gdw439 opened this issue Oct 28, 2022 · 1 comment
Labels
ep:CUDA issues related to the CUDA execution provider

Comments

@gdw439
Copy link

gdw439 commented Oct 28, 2022

Describe the issue

key words: GPU model, memory, C++
Describe: when I use ORT 1.12.1 in linux and nvidia T4, I try to run inference one by one, and then stop to watch GPU memory. I find the GPU memory is number A when the model loaded, then after run it raise to B, but long time to run and stop, the GPU memory is not decrease to A,is it right? as the gpu memory seems not be released. By the way, the model I use is float16.

To reproduce

as describe.

Urgency

No response

Platform

Linux

OS Version

ubuntu 14.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12.1

ONNX Runtime API

C++

Architecture

X86

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.2

Model File

No response

Is this a quantized model?

Yes

@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Oct 28, 2022
@jywu-msft
Copy link
Member

memory is cached in an arena to avoid overhead of allocating again.
see #9509 (comment) for some more details about how to influence the behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider
Projects
None yet
Development

No branches or pull requests

2 participants