Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Node Grouping in Kedro Deployment #4319

Open
2 tasks
DimedS opened this issue Nov 11, 2024 · 14 comments
Open
2 tasks

Improve Node Grouping in Kedro Deployment #4319

DimedS opened this issue Nov 11, 2024 · 14 comments
Assignees
Milestone

Comments

@DimedS
Copy link
Member

DimedS commented Nov 11, 2024

Overview

Part of #4317. Users have expressed the need to merge multiple Kedro nodes into a single task on deployment platforms for better clarity and efficiency. Current plugins offer limited support for this, often requiring manual grouping, which complicates deployment and reduces performance.

User Insights and Challenges

  • "Combining nodes into single tasks improves overview, but we currently have to manually group them in Databricks."
  • "We can convert a single node to a Kubeflow Component, but deploying 400 nodes as separate containers adds complexity."
  • "Running each Kedro node in a separate container could make a small node execute in one or two seconds, but Argo’s longer pod startup time would make this inefficient."

Problem Statement

How can we design a flexible and efficient node grouping mechanism - using tags, namespaces, pipelines, or other methods - to maximise usefulness for users and streamline the deployment process?

Proposed Solution

  • Centralised Grouping Functionality: Instead of developing node grouping features separately for each plugin, centralise this functionality within the Kedro framework. This approach would standardise and simplify node grouping, making it easier to implement and maintain across different deployment platforms.

Tech Design - 4th & 6th December

It was decided to use namespaces for node grouping purposes and to implement helper functions within Kedro to simplify the deployment of grouped nodes with namespaces. Full summary of TD

Next steps:

@merelcht merelcht added this to the Deployment milestone Nov 11, 2024
@merelcht merelcht moved this to To Do in Kedro Framework Nov 11, 2024
@datajoely
Copy link
Contributor

This is the most important problem for me. It's also tightly coupled with dependency management - the minute we make it easier to isolate different parts of the pipeline to be run on different containers you get into dependency isolation questions.

@datajoely
Copy link
Contributor

Users today also tend towards tags because namespaces are a pain to use

@DimedS
Copy link
Member Author

DimedS commented Dec 9, 2024

Summary of Tech Design Sessions on Node Grouping (4, 6 Dec 2024)

Video

Discussed:

  • Node Grouping Methods: Reviewed different methods for grouping nodes in Kedro, including tags, pipelines, and namespaces. Discussed the pros and cons of each approach and how they affect deployment when converting Kedro pipelines to various platforms.
  • Deployment Options: Considered two options for deploying grouped nodes:
    1. Helper functions within Kedro combined with modifications to deployment plugins.
    2. Using synthetic nodes.

Decisions:

  1. Adopting Namespaces for Deployment:
    • Reasoning:
      Despite observed challenges with namespaces (e.g., limited adoption and UI complexities), namespaces are considered the right tool for this use case. They offer value to users and should be more widely utilized. Enhancing deployment functionality based on namespaces, alongside improving the UI, will likely boost their adoption.
  2. Choosing Helper Functions + Plugins Over Synthetic Nodes:
    • Reasoning:
      Synthetic nodes are technically complex, less useful for users, and harder to implement. Instead, focusing on helper functions and plugin modifications aligns better with Kedro’s approach and encourages collaboration with plugin maintainers to ensure proper maintenance.

Next Steps:

  1. Prototype Development:
    • Develop a deployment solution prototype using helper functions in Kedro, tested with the kedro-airflow plugin to enable namespace deployment to Airflow. Results will be shared in the Tech Design session, approximately in January.
  2. Namespace UI Improvements:
  3. Engagement with Plugin Developers:

Additional Information:

Feel free to add your thoughts or suggestions here. If there’s anything to update or clarify in the summary, please let me know.
A link to the video recordings of both sessions will be published soon.

@DimedS
Copy link
Member Author

DimedS commented Dec 9, 2024

from @marrrcin :
_I was not able to attend the follow up :sadcat2:
For plugin development, a functionality that would be super helpful would be an utility function in official Kedro API that will allow to pass in the:

  • pipeline
  • grouping criterion
    and result in list of grouped nodes on the grouping criterion.
    I don't know if that's how you plan to expose that :thinking_face:
    By iterating on the results of such grouped pipeline, plugins will be able to easily do the M:N mapping instead of 1:1 . Then plugins will be able to create orchestrator-specific pipelines which will invoke sth like kedro run --from-nodes --to-nodes .

@astrojuanlu
Copy link
Member

Despite observed challenges with namespaces (e.g., limited adoption and UI complexities), namespaces are considered the right tool for this use case.

I value the desire of keeping momentum and devoting more time to understanding what namespaces do and how do they work so at least we can discuss more intelligently about them.

But going forward I think there's an opportunity to explore innovative new solutions, or just make namespaces an implementation detail so that they continue to exist but they become invisible for the user and get swept under a more usable API layer. I know I sound like a broken record but I'll say it again: more documentation will not fix bad Developer Experience.

In short: agree to continue exploring them (on the grounds of keeping the momentum on this topic and not having to do another knowledge sharing session in ~12 months), but I think we should timebox this effort, and put a deadline on when do we think we're ready to go back to the drawing board and continue iterating as a team.

@datajoely

This comment was marked as off-topic.

@marrrcin

This comment was marked as off-topic.

@astrojuanlu

This comment was marked as off-topic.

@datajoely
Copy link
Contributor

I think namespaces are critical for any way we eventually unify and simplify deployment to orchestrators, I'm arguing dependency isolation is the basically the same problem - but would like to see if others agree?

@deepyaman

This comment was marked as off-topic.

@Galileo-Galilei

This comment was marked as off-topic.

@marrrcin
Copy link
Contributor

marrrcin commented Dec 10, 2024

I agree with the need for dependency isolation for large projects 👍🏻

For grouping by the namespace - although it's fine to have anything to group on, namespaces were (at least for me) used to group larger chunks of the pipeline, e.g.:

<ns1 = data processing part> --> <ns2 = modeling variant 1>
                             --> <ns2 = modeling variant 2>

etc. (like in the dynamic pipelines https://getindata.com/blog/kedro-dynamic-pipelines/ ).

So grouping by the namespace will actually require thinking about the target deployment as soon as you start writing the pipeline, which means it will impact the data catalog/parameters creation too (because of namespace prefixes) = more cognitive load.

Plus the projects might end up with having a lot of "synthetic" namespaces, just for the sake of preparing the pipeline for an orchestrator.

@astrojuanlu
Copy link
Member

Copy-pasting some comments about dependency isolation to #4147 and collapsing them here

@deepyaman
Copy link
Member

Copy-pasting some comments about dependency isolation to #4147 and collapsing them here

OK, but I'm going to just reiterate my points specifically related to node grouping and deployment, so that they don't get skipped/misconstrued as being only relevant for dependency management:

  • Supporting complex node grouping is unnecessary, at least to start with; that's akin to the micro-packaging over-engineering problem. The vast majority of users would benefit from being able to deploy each modular pipeline (with it's own namespace) separately. This also creates pretty clear boundaries where node persistence, etc. may be required.

TL;DR I agree with the idea of namespaces and supporting deployment based on namespaces, but I further posit that namespaces just being per modular pipeline is sufficient, even for deployment, and that will also make them easier to adopt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

8 participants