Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor run methods more into abstract method #4353

Merged
merged 25 commits into from
Dec 10, 2024

Conversation

merelcht
Copy link
Member

@merelcht merelcht commented Nov 26, 2024

Description

Resolves #4290

Development notes

  • Refactored and merged logic of the various _run method implementations into the abstract _run method.
  • Added abstract _get_executor method, implemented by each runner.
  • Refactored logic to determine number of workers from ParallelRunner and ThreadRunner into shared validate_max_workers method.
  • Changed hook_manager argument in runners to allow it to be None, which is needed for the ParallelRunner, because hook manager can't be serialised.
  • Updated TestSuggestResumeScenario for SequentialRunner, because it's now using a ThreadPoolExecutor, the suggestions can vary per run. I've manually created a project with the same pipelines and verified that the suggestions (even if they vary) do always work.
  • Added TestSuggestResumeScenario tests for ThreadRunner.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@merelcht merelcht self-assigned this Nov 26, 2024
merelcht and others added 2 commits November 27, 2024 11:22
Signed-off-by: Merel Theisen <[email protected]>
@merelcht merelcht marked this pull request as ready for review November 27, 2024 11:43
@merelcht merelcht requested review from idanov and noklam November 27, 2024 11:44
) -> ThreadPoolExecutor | ProcessPoolExecutor:
"""Abstract method to provide the correct executor (e.g., ThreadPoolExecutor or ProcessPoolExecutor)."""
pass

@abstractmethod # pragma: no cover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still an abstractmethod?

Copy link
Member Author

@merelcht merelcht Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because it's still necessary to have _run() for any Runner class and you can still overwrite this when creating a custom Runner. But now it's also easier to create a custom runner, because the _run() method has more in place to get started from.

kedro/runner/sequential_runner.py Outdated Show resolved Hide resolved
kedro/runner/runner.py Outdated Show resolved Hide resolved
)

self._release_datasets(node, catalog, load_counts, pipeline)
super()._run(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still needed if it's just using the base method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because _run() is still an abstract method.

Comment on lines 13 to 18
from collections import Counter, deque
from concurrent.futures import (
FIRST_COMPLETED,
ProcessPoolExecutor,
ThreadPoolExecutor,
wait,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this import multiprocessing as a dependencies? I recalled in the past we have issues with ShelveStore because even importing the library cause issues on restricted environment like AWS Lambda.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah I think it does.. Do you remember what the problem was with that?

@merelcht merelcht requested a review from DimedS December 2, 2024 11:20
Copy link
Member

@idanov idanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's amazing how similar now all the runners start to look like, I love it!

kedro/runner/runner.py Outdated Show resolved Hide resolved
kedro/runner/sequential_runner.py Outdated Show resolved Hide resolved
kedro/runner/runner.py Outdated Show resolved Hide resolved
Signed-off-by: Merel Theisen <[email protected]>
Copy link
Member

@idanov idanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this makes all the runners look almost identical!

kedro/runner/thread_runner.py Outdated Show resolved Hide resolved
kedro/runner/sequential_runner.py Outdated Show resolved Hide resolved
kedro/runner/sequential_runner.py Outdated Show resolved Hide resolved
Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, @merelcht! Great job on unifying and simplifying the runner's code - it looks great to me! I’ve left one comment suggesting a potential complexity optimisation.


with self._get_executor(max_workers) as pool:
while True:
ready = {n for n in todo_nodes if node_dependencies[n] <= done_nodes}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose reducing the complexity of that line in a separate PR because this becomes more critical with the unification of runners. As I understand, the previous implementation of the SequentialRunner (most popular runner) simply executed an ordered list of nodes sequentially. However, the current approach performs a full loop over all rest nodes after each executed node, this results in an overall complexity greater than quadratic.

I believe we can achieve linear complexity by maintaining a count of unmet dependencies for each node. As nodes are completed, we decrement the counter for their dependents and mark a node as ready when its counter reaches zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point @DimedS. I'll create a new issue for this, so it's not forgotten.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merelcht merelcht enabled auto-merge (squash) December 10, 2024 15:33
@merelcht merelcht merged commit f6ecca5 into main Dec 10, 2024
34 checks passed
@merelcht merelcht deleted the refactor/abstract-more-into-run-method branch December 10, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Abstract _run as much as possible
4 participants