[FEA] Add host memory task metrics and explore a host memory allocation watch dog. #8880
Closed
6 tasks done
Labels
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
task
Work required that improves the product but is not user facing
Is your feature request related to a problem? Please describe.
This is a follow on to #8879 The idea is that we probably are not going to get all of the blocking code 100% perfect in the first go at this. Especially if we have both GPU memory allocations and host memory allocations blocking tasks. We should look at adding in a watchdog of some kind that would be able to detect if nothing has happened to a task in a long time, and have us try to break the potential deadlock with an exception. This part is a bit experimental. It might not work out, which is why we are going to explore it. But as a part of this we should also add in metrics.
We will need to have some metrics on the time being taken for blocked processes and spilled processes. I think we want to have 5 new task level metrics and we will remove two that we currently have. Right now we have the amount of time spent spilling. But that is measured at the GPU spill entry point. Because we are adding in a new entry point where we could spill with a host allocation, we want to split it up. We will have one metric for the amount of time a task was blocked on host memory allocation; another for the amount of time spent transferring data from the GPU to the host; one for the amount of time the task spent spilling from host memory to disk; one for the amount of time spent reading spilled data back to host memory; and finally one for the amount of time spent reading spilled data back to GPU memory.
Tasks
The text was updated successfully, but these errors were encountered: