[Performance] #13492

gdw439 · 2022-10-28T07:05:14Z

Describe the issue

key words: GPU model, memory, C++
Describe: when I use ORT 1.12.1 in linux and nvidia T4, I try to run inference one by one, and then stop to watch GPU memory. I find the GPU memory is number A when the model loaded, then after run it raise to B, but long time to run and stop, the GPU memory is not decrease to A，is it right? as the gpu memory seems not be released. By the way, the model I use is float16.

To reproduce

as describe.

Urgency

No response

Platform

Linux

OS Version

ubuntu 14.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12.1

ONNX Runtime API

C++

Architecture

X86

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.2

Model File

No response

Is this a quantized model?

Yes

jywu-msft · 2022-11-02T16:11:45Z

memory is cached in an arena to avoid overhead of allocating again.
see #9509 (comment) for some more details about how to influence the behavior.

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] #13492

[Performance] #13492

gdw439 commented Oct 28, 2022

jywu-msft commented Nov 2, 2022

[Performance] #13492

[Performance] #13492

Comments

gdw439 commented Oct 28, 2022

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

jywu-msft commented Nov 2, 2022