Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qualification tool - Handle cancelled jobs and stages better and don't skip the app #1033

Merged
merged 2 commits into from
May 24, 2024

Conversation

tgravescs
Copy link
Collaborator

@tgravescs tgravescs commented May 23, 2024

Fixes #1032

I ran an event log through the qualification tool and it got labelled as not applicable because it had failed stages. Those failed stages though were cancelled by AQE runs.

We should take this into account in the qual tool.

The reasons in task show up as: Stage cancelled...
The stage failure reason shows: Job 243 cancelled

tool output:
24/05/23 10:00:26 WARN QualificationEventProcessor: SQL execution id 47 had failures, skipping
24/05/23 10:00:26 WARN QualificationEventProcessor: SQL execution id 125 had failures, skipping

This PR fixes that by looking for cancelled in the failure messages ignores those as failures.

I tested on customer event log and this is working. Need to put that event log into our integration tests.

@tgravescs tgravescs added the bug Something isn't working label May 23, 2024
@tgravescs tgravescs self-assigned this May 23, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgravescs for putting the fix.
During refactoring the tools, I found that the profiler was adding treating failed jobs in a different way.

Shall we fix the two cod blocks to get the Q/P consistent? Or you prefer to file a separate issue?

def hasFailed: Boolean = {
sInfo.failureReason.isDefined
}

For task failed, should we check for event.taskInfo.taskKilled in

this is the code that loops on failed tasks.

// Return a list of tasks that failed within all the stageAttempts
def getAllFailedTasks: Iterable[TaskModel] = {
getAllTasks(Some(!_.successful))
}

@amahussein amahussein added the core_tools Scope the core module (scala) label May 24, 2024
@tgravescs
Copy link
Collaborator Author

they should be separate. when I looked briefly at the profiling tool, I know its outputting failed jobs to files. We still want to do that as that is how Spark is showing them. I didn't look at all the rollups though to see where it they are affected. Again a separate issue which I don't think is as important.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTME

@tgravescs tgravescs merged commit 89fbf83 into NVIDIA:dev May 24, 2024
16 checks passed
@tgravescs tgravescs deleted the handleCancelled branch May 24, 2024 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
2 participants