Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix implementation of processSQLPlanMetrics in Profiler #853

Merged
merged 3 commits into from
Mar 14, 2024

Conversation

amahussein
Copy link
Collaborator

@amahussein amahussein commented Mar 14, 2024

Signed-off-by: Ahmed Hussein (amahussein) [email protected]

Fixes #851

There are still more opportunities to improve the performance but I limited the scope of this fix to get the highest gains with minimum efforts.

  • The implementation of processSQLPlanMetrics to parameterize the information based on the SqlID
  • Added a buffer SqlPlanInfoGraphBuffer to capture the construction of the SprkSqlPlan. Otherwise, the graph we constructed multiple times which was not efficient
  • Changed JDOC in AutoTuner to cleanup build warning.
  • Refactoring the implementation of jobAndStageMetricsAggregation reduced both Mem and CPU by 20%
  • Made a few changes in jobAndStageMetricsAggregation which reduced the total memory allcated by this method
  • Changed the type of local variables from Seq to Set
  • Removed some local variables that were causing memory bloats
  • Renamed GenerateDot.SparkPlanGraph to SparkPlanGraphForDot because it was conflicting with Spark.SparkPlanGraph class in the imports
  • Fixed unit test
  • Added a new Set to keep track of missing event classes which reduces the noise in the log files.

Overall Impact

After Changes:

Before Changes:

The below snapshot indicates the frequency of GC and the idleness of CPU

before-changes-Screenshot 2024-03-14 at 7 40 35 AM

Does this affect the end user?

Yes.

  • The output related to "problematic issues" has changed after fixing the bug.
    • the generated output ofsql_duration_and_executor_cpu_time_percent.csv is different after fixing the bug
  • Runtime of the profiler was improved

Does this break the nightly builds?

Yes.
Some of generated output files have changed.

Does this require additional followups?

There is still a room to improve the efficiency of the core. Some of the ToDos:

  • Revisit the Qualification to apply the same techniques.
  • There is a ton of code duplicate and inefficiency in Analysis.scala where we keep extracting the Jobs/tasks in every function.
  • Improve the scope of variables so that GC can reclaim the memory once out of scope.
  • Improve the collection of System.properties in the profiler.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Fixes NVIDIA#851

- The implementation of `processSQLPlanMetrics` to parameterize the
  information based on the SqlID
- Added a buffer `SqlPlanInfoGraphBuffer` to capture the construction of
  the SprkSqlPlan. Otherwise, the graph we constructed multiple times
which was not efficient
- Changed JDOC in AutoTuner to cleanup build warning.
- Refactoring the implementation of `jobAndStageMetricsAggregation`
  reduced both Mem and CPU by 20%
- Made a few changes in `jobAndStageMetricsAggregation` which reduced the
  total memory allcated by this method
- Changed the type of local variables from `Seq` to `Set`
- Removed some local variables that were causing memory bloats
- Renamed `GenerateDot.SparkPlanGraph` to `SparkPlanGraphForDot` because
  it was conflicting with Spark.SparkPlanGraph class in the imports
- Fixed unit test
- Added a new Set to keep track of missing event classes which reduces
  the noise in the log files.
@amahussein amahussein added bug Something isn't working core_tools Scope the core module (scala) reliability labels Mar 14, 2024
@amahussein amahussein self-assigned this Mar 14, 2024
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
val stagesInJob = app.stageIdToInfo.filterKeys { case (sid, _) =>
stageIdsInJob.contains(sid)
}.keys.map(_._1).toSeq
jc.stageIds.contains(sid)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val stageIdsInJob = jc.stageIds was removed to save memory compared to directly accessing jc.stageIds

stageIdsInJob.contains(sid)
}.keys.map(_._1).toSeq
jc.stageIds.contains(sid)
}.keys.map(_._1).toSet
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use toSet instead of toSeq because it is going to be used mainly for lookup.

}
}
val missing = app.stageIdToInfo.keys.toSeq.diff(allStageinJobs.keys.toSeq)
val missing = app.stageIdToInfo.keys.toSet.diff(allStageInJobs.keys.toSet)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sets are more time efficient compared to sequences

@@ -231,12 +227,12 @@ class Analysis(apps: Seq[ApplicationInfo]) {
val allRows = apps.flatMap { app =>
app.sqlIdToInfo.map { case (sqlId, sqlCase) =>
val jcs = app.jobIdToInfo.filter { case (_, jc) =>
jc.sqlID.getOrElse(-1) == sqlId
jc.sqlID.isDefined && jc.sqlID.get == sqlId
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getOrElse implies that the VM allocates memory even if the jc.sqlID is not defined. This tends to be very expensive in large structures.
Since this is a predicate filter, it is more memory efficient to get the sqlID only if it is defined.

val allMetaWithSchema = getPlanMetaWithSchema(planInfo)
val planGraph = ToolsPlanGraph(planInfo)
val allNodes = planGraph.allNodes
val allMetaWithSchema = getPlanMetaWithSchema(sqlPlanInfoGraph.planInfo)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the cached graph instead of reconstructing it again.

val allNodes = planGraph.allNodes
val allMetaWithSchema = getPlanMetaWithSchema(sqlPlanInfoGraph.planInfo)
val allNodes = sqlPlanInfoGraph.sparkPlanGraph.allNodes
val results = ArrayBuffer[DataSourceCase]()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache the results to be returned to the caller. Otherwise, the original code was accessing the global data structure to extract what we have just calculated here.

}

// This will find scans for DataSource V2, if the schema is very large it
// will likely be incomplete and have ... at the end.
protected def checkGraphNodeForReads(sqlID: Long, node: SparkPlanGraphNode): Unit = {
protected def checkGraphNodeForReads(
sqlID: Long, node: SparkPlanGraphNode): Option[DataSourceCase] = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return the datasource to be used by the caller instead of querying the global data structure.

}
connectOperatorToStage(createGraphFunc)
for (sqlPIGEntry <- sqlPlanInfoBuffer.sqlPlanInfoGraphs) {
var sqlIsDsOrRDD = false
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used to avoid accessing global sqlIDToDataSetOrRDDCase

Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein ! This helps in reducing the runtime of Profile tool. Agree that we can make some more improvements. Do you plan on creating followon issue for the TODO's mentioned in the description?

@amahussein
Copy link
Collaborator Author

Thanks @amahussein ! This helps in reducing the runtime of Profile tool. Agree that we can make some more improvements. Do you plan on creating followon issue for the TODO's mentioned in the description?

Thanks @nartal1 !
I did not want to open more issues for now because it will become a swamp. In addition, the ToDos are some sort of obvious whenever we need to pursue this.
There is an open umbrella issue #367 where we can append things to be done related to performance.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
@amahussein amahussein merged commit a8ed8f3 into NVIDIA:dev Mar 14, 2024
13 checks passed
@amahussein amahussein deleted the spark-rapids-tools-851 branch March 14, 2024 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala) reliability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Fix implementation of processSQLPlanMetrics in Profiler
2 participants