Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a lightweight solution to speed up session reload or create new session #2879

Open
noklam opened this issue Aug 1, 2023 · 7 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@noklam
Copy link
Contributor

noklam commented Aug 1, 2023

Quotes

Carlos Barreto
We are using Kedro as part of an event stream + Amazon ECS solution. What they want to check is if there is a way to always have the Kedro context up and running having an API call to execute the pipeline only when necessary. I was thinking that this is possible by programmatically generating the KedroContext, making it a global service, and only using specific pipeline calls. But I don’t know if we have any similar use cases implemented already, and I wanted to get some opinions on it. Today, we runs something like a kedro run inside the container, every time, and this ends up spending important warm-up seconds loading the context/dependencies into memory.

Description

As I have many development work with IPython or Jupyter, often I want to make small changes to test if it works. %reload_kedro could be quite slow and the developing experience is frustrating because for every change .

This also potentially related to #1853, #2134, #2182

kedro ipython take > 20s to start and %reload_kedro takes

Context

After this PR, session can only be run once. The easiest way to create a new session is %reload_kedro. While %reload_kedro works, it is considerably slow with big project for a few reasons:

  • It recreates everything session,context,pipelines,catalog.
  • If certain datasets exist, it will even re-establish connection to database (slow) Lazy Loading of Catalog Items  #2829
  • All the plugin hooks are registered again - evident by the log message

INFO Registered line magic init.py:115 'run_viz'

What's the minimal effort to recreate session?

If we look into the code, there is a self._run_called attribute and everytime we do session.run it will check if it is True.

try:
run_result = runner.run(
filtered_pipeline, catalog, hook_manager, session_id
)
self._run_called = True

if self._run_called:
raise KedroSessionError(
"A run has already been completed as part of the"
" active KedroSession. KedroSession has a 1-1 mapping with"
" runs, and thus only one run should be executed per session."
)

Why do we need this check? Mainly because of session_id need to be a unique value, otherwise it can cause error in experiment tracking (kedro-viz) because it need to be a unique id. If we simply override session._run_called = False and do session.run(), almost everything will work.

Experiment-tracking is not a core feature of kedro (but kedro-viz), is there other obivous reason that we need to protect session_id from running twice?

(edited)
It could be related to the timestamp for saving versioned data. However, it's unclear to me because catalog get save_version from session_id, but there is another function that you can find in most dataset implementation.

save_version = self.resolve_save_version()

Possible Implementation

Source: #1551 (comment)

(Bonus) - KedroSession.reset() to create a new session easily? - this can potentially make the Jupyter workflow nicer. Instead of asking user to create their session with lots of details, they can just take the global session and do session.reset() #1571

Maybe implement a session.clear(), session.reset() method

Possible Alternatives

  • Speed up reload_kedro so the overhead is insignificant.
  • Remove the session._run_called checks
@noklam
Copy link
Contributor Author

noklam commented Sep 21, 2023

Muhammed Afnas
12:03 PM
hi everyone,
can we initiate multiple sessions in kedro? if yes, could anyone help me with it?
kedro version - 0.18
i am building a web application where in i have to trigger the different pipelines of a kedro project based on button clicks on the dash ui.
as of now, individually it is working, but when one session is running, if i tries to trigger another session it gives a runtime error.

@astrojuanlu
Copy link
Member

Experiment-tracking is not a core feature of kedro (but kedro-viz), is there other obivous reason that we need to protect session_id from running twice?

I recall there's some issue about session_id that @datajoely identified in his research. Maybe it's related?

@noklam
Copy link
Contributor Author

noklam commented Sep 21, 2023

That's more related to orchestration and it requires a way to pass a unique identifier when the run is spread to multiple KedroSession

@datajoely
Copy link
Contributor

session_id is used for versioning too which is why it needs to be alphabetically sortable

Arguably if we kept a private session_id and exposed a parameterisable one that would be sufficient

@astrojuanlu
Copy link
Member

Uh, we're sorting by session_id? Maybe we should store the datetime instead, but this might be a bit of a digression.

@datajoely
Copy link
Contributor

The session_id was the Versioning ID way back when - @merelcht @idanov can provide more context here

@astrojuanlu
Copy link
Member

Moving this to the Session milestone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

3 participants