[FEA] Add host memory task metrics and explore a host memory allocation watch dog. #8880

revans2 · 2023-07-31T18:25:05Z

Is your feature request related to a problem? Please describe.
This is a follow on to #8879 The idea is that we probably are not going to get all of the blocking code 100% perfect in the first go at this. Especially if we have both GPU memory allocations and host memory allocations blocking tasks. We should look at adding in a watchdog of some kind that would be able to detect if nothing has happened to a task in a long time, and have us try to break the potential deadlock with an exception. This part is a bit experimental. It might not work out, which is why we are going to explore it. But as a part of this we should also add in metrics.

We will need to have some metrics on the time being taken for blocked processes and spilled processes. I think we want to have 5 new task level metrics and we will remove two that we currently have. Right now we have the amount of time spent spilling. But that is measured at the GPU spill entry point. Because we are adding in a new entry point where we could spill with a host allocation, we want to split it up. We will have one metric for the amount of time a task was blocked on host memory allocation; another for the amount of time spent transferring data from the GPU to the host; one for the amount of time the task spent spilling from host memory to disk; one for the amount of time spent reading spilled data back to host memory; and finally one for the amount of time spent reading spilled data back to GPU memory.

Tasks

Give feedback

metric for the amount of time a task was blocked on host memory allocation
amount of time spent transferring data from the GPU to the host
amount of time the task spent spilling from host memory to disk
amount of time spilling from device to disk
amount of time spent reading spilled data back to host memory
one for the amount of time spent reading spilled data back to GPU memory
Options

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jul 31, 2023

revans2 mentioned this issue Jul 31, 2023

[FEA] Limit Host Memory Usage #8874

Open

30 tasks

revans2 added task Work required that improves the product but is not user facing reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed feature request New feature or request labels Jul 31, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 8, 2023

mattahrens assigned gerashegalov Sep 1, 2023

gerashegalov mentioned this issue Nov 8, 2023

Fine-grained spill metrics #9509

Merged

jlowe closed this as completed in #9509 Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add host memory task metrics and explore a host memory allocation watch dog. #8880

[FEA] Add host memory task metrics and explore a host memory allocation watch dog. #8880

revans2 commented Jul 31, 2023 •

edited by gerashegalov

Loading

Tasks

[FEA] Add host memory task metrics and explore a host memory allocation watch dog. #8880

[FEA] Add host memory task metrics and explore a host memory allocation watch dog. #8880

Comments

revans2 commented Jul 31, 2023 • edited by gerashegalov Loading

Tasks

revans2 commented Jul 31, 2023 •

edited by gerashegalov

Loading