Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StateBackend in DataFusion's RuntimeEnv #11365

Open
ameyc opened this issue Jul 9, 2024 · 3 comments
Open

StateBackend in DataFusion's RuntimeEnv #11365

ameyc opened this issue Jul 9, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@ameyc
Copy link
Contributor

ameyc commented Jul 9, 2024

Is your feature request related to a problem or challenge?

Currently DataFusion operators communicate via a narrow API i.e. forwarding SendableRecordBatchStreams. In some instances, in particular the ExecutionPlans operating on unbounded streams need to snapshot their state and co-ordinate with source operators. It'd be a powerful primitive to add a StateBackend concept to the RuntimeEnv where users could then write operators to store adhoc durable state into a backend such as rocksdb.

Realise this may not be useful for many use cases but RuntimeEnv does seem to have ability to plug in an object store registry as well as a catalog manager. This would be a crucial unlock to make stateful stream processing application with DataFusion.

If the current, API contains such a pathway already, would love to get pointers in the right direction.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@ozankabak
Copy link
Contributor

We actually do something like this in our use of DF for stream processing. Since it could remain unused/irrelevant for cases other than stream processing, it may be an overfit to add something like this to upstream DF. Let's see if there are other use cases for something like this. If we discover more use cases, maybe it could make sense to add something like this.

@ameyc
Copy link
Contributor Author

ameyc commented Jul 11, 2024

makes sense @ozankabak , StateBackend trait seems like it could be of use to more operators than just streaming was what we were thinking, especially the operators that spill to disk. Is there a way to make the state backend a part of the SessionState without needing this change or use an existing feature?

@emgeee
Copy link
Contributor

emgeee commented Jul 29, 2024

Even aside from the specifics of StateBackend, it seems like RuntimeEnv could benefit from more flexibility.
Perhaps we could add extension functionality via a trait to the RuntimeEnv that would allow users to register custom structs that could then be downcast by any custom operator that need them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants