Qualification tool - Handle cancelled jobs and stages better and don't skip the app #1033

tgravescs · 2024-05-23T15:40:31Z

I ran an event log through the qualification tool and it got labelled as not applicable because it had failed stages. Those failed stages though were cancelled by AQE runs.

We should take this into account in the qual tool.

The reasons in task show up as: Stage cancelled...
The stage failure reason shows: Job 243 cancelled

tool output:
24/05/23 10:00:26 WARN QualificationEventProcessor: SQL execution id 47 had failures, skipping
24/05/23 10:00:26 WARN QualificationEventProcessor: SQL execution id 125 had failures, skipping

This PR fixes that by looking for cancelled in the failure messages ignores those as failures.

I tested on customer event log and this is working. Need to put that event log into our integration tests.

Signed-off-by: Thomas Graves <[email protected]>

amahussein

Thanks @tgravescs for putting the fix.
During refactoring the tools, I found that the profiler was adding treating failed jobs in a different way.

Shall we fix the two cod blocks to get the Q/P consistent? Or you prefer to file a separate issue?

spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModel.scala

Lines 61 to 63 in a6fdc86

    
           def hasFailed: Boolean = { 
        
             sInfo.failureReason.isDefined 
        
           }

For task failed, should we check for event.taskInfo.taskKilled in

spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/store/TaskModel.scala

Line 88 in a6fdc86

event.taskInfo.successful,

this is the code that loops on failed tasks.

spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/store/TaskModelManager.scala

Lines 98 to 101 in a6fdc86

    
           // Return a list of tasks that failed within all the stageAttempts 
        
           def getAllFailedTasks: Iterable[TaskModel] = { 
        
             getAllTasks(Some(!_.successful)) 
        
           }

tgravescs · 2024-05-24T15:33:00Z

they should be separate. when I looked briefly at the profiling tool, I know its outputting failed jobs to files. We still want to do that as that is how Spark is showing them. I didn't look at all the rollups though to see where it they are affected. Again a separate issue which I don't think is as important.

amahussein

LGTME

tgravescs added 2 commits May 23, 2024 10:31

Handle cancelled jobs and stages better and don't skip the app

8795db4

Signed-off-by: Thomas Graves <[email protected]>

add error for unknown job type

ac8bc7a

tgravescs added the bug Something isn't working label May 23, 2024

tgravescs self-assigned this May 23, 2024

amahussein reviewed May 24, 2024

View reviewed changes

amahussein added the core_tools Scope the core module (scala) label May 24, 2024

amahussein approved these changes May 24, 2024

View reviewed changes

tgravescs merged commit 89fbf83 into NVIDIA:dev May 24, 2024
16 checks passed

tgravescs deleted the handleCancelled branch May 24, 2024 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualification tool - Handle cancelled jobs and stages better and don't skip the app #1033

Qualification tool - Handle cancelled jobs and stages better and don't skip the app #1033

tgravescs commented May 23, 2024 •

edited

Loading

amahussein left a comment

tgravescs commented May 24, 2024

amahussein left a comment

	// Return a list of tasks that failed within all the stageAttempts
	def getAllFailedTasks: Iterable[TaskModel] = {
	getAllTasks(Some(!_.successful))
	}

Qualification tool - Handle cancelled jobs and stages better and don't skip the app #1033

Qualification tool - Handle cancelled jobs and stages better and don't skip the app #1033

Conversation

tgravescs commented May 23, 2024 • edited Loading

amahussein left a comment

Choose a reason for hiding this comment

tgravescs commented May 24, 2024

amahussein left a comment

Choose a reason for hiding this comment

tgravescs commented May 23, 2024 •

edited

Loading