-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Qualification tool should look at spill metrics #477
Comments
To add some context, I wish Qualification tool can consider both of below factors:
For example, Standard_NC8as_T4_v3 has 240MB/s combined read/write for each local disk per Spark executor; If the cluster shape information is provided as the input for Qualification tool, it should figure out a way to measure how much negative impact based on above 2 factors. |
The problem is that just because CPU spills doesn't mean GPU is going to spill and vise versa, cpu might not spill but GPU does. A lot of factors go into this. This is a very hard problem. Maybe we can do some heuristics about if spill on CPU is over certain threshold GPU will likely spill too, or we need to look at other characteristics/metrics in the job with cluster, but that definitely gets complex. |
We discussed offline the initial steps to tackle this issue:
|
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Contributes to NVIDIA#477 This code change aims at bringing the Q/P tools handling of stages and their accumulator to a common ground. There is a couple of fixes done in this code change including: - Capture accumulator IDs of a stage during a stage completion event. - Fix the construction of MLFunctions - Fix the implementation of `jobAndStageMetricsAggregation` which was not efficient in iterating multiple times of the tasks list. - Remove redundant Data structure that maps between accumulators and stages.
* Refactor Stage info code between Q/P tools Contributes to #477 This code change aims at bringing the Q/P tools handling of stages and their accumulator to a common ground. There is a couple of fixes done in this code change including: - Capture accumulator IDs of a stage during a stage completion event. - Fix the construction of MLFunctions - Fix the implementation of `jobAndStageMetricsAggregation` which was not efficient in iterating multiple times of the tasks list. - Remove redundant Data structure that maps between accumulators and stages. * Remove unused class StageInfoClass * Move StageModelManager to a separate scala class file --------- Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Is your feature request related to a problem? Please describe.
spill can be very time consuming, the qualification tool should try to take spill metrics into account.
Note, originally mentioned in #73
Tasks
The text was updated successfully, but these errors were encountered: