Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in a layer of indirection for task completion callbacks [databricks] #9009

Merged
merged 5 commits into from
Aug 14, 2023

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Aug 11, 2023

This fixes #8482

It touches a lot of files because I wanted to move as much of the old API over to the new one so that I could properly test it. I think there may be other places where removing a callback might be nice too, but I didn't take that route here.

@revans2 revans2 self-assigned this Aug 11, 2023
@revans2
Copy link
Collaborator Author

revans2 commented Aug 11, 2023

build

@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Aug 11, 2023
@revans2
Copy link
Collaborator Author

revans2 commented Aug 11, 2023

build

@revans2
Copy link
Collaborator Author

revans2 commented Aug 14, 2023

build

private var wasCalled = false

override def removeCallback(): Unit = {
val topLevel = ScalableTaskCompletion.getTopLevel(tc.taskAttemptId())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call remove on the hashmap blindly, rather than get and remove?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is removing a callback from a task. We get the collection object that holds task callbacks and check to see if it (the collection object) is null. If it is not we then remove ourselves, the callback, from that collection. I guess each handle could hold onto a reference to the original collection associated with it instead of asking the singleton for it. Is that what you are asking?

* When the current task completes, if there is a current task call the given function.
* @return a handle that can be used to remove the callback.
*/
def onTaskCompletionIfNotTest(f: => Unit): TaskCompletionCallbackHandle = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little unclear on when I should use onTaskCompletionIfNotTest vs onTaskCompletion and it feels like having multiple choices is an area where we could introduce bugs. Is there any way we could change the logic if we know we are in the scala tests, but the calling code is unaware and has 1 choice to register?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onTaskCompletionIfNotTest inserts a callback if a TaskContext is available. If none is available, then it is not inserted. This was done by a lot of code for unit tests. The problem is that I don't want to encourage this behavior. The API itself is somewhat dangerous because we might or might not get a callback. I am fine with removing it and then forcing everywhere it is used to put in their own workaround.

@revans2
Copy link
Collaborator Author

revans2 commented Aug 14, 2023

build

@revans2
Copy link
Collaborator Author

revans2 commented Aug 14, 2023

@abellina and @jlowe I think I have addressed all feedback. If you could take another look I'd appreciate it.

@revans2 revans2 merged commit efe3310 into NVIDIA:branch-23.10 Aug 14, 2023
27 checks passed
@revans2 revans2 deleted the task_callback_scaling branch August 14, 2023 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Potential leak on SplitAndRetry when iterator not fully drained
4 participants