-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357
Comments
It seems to me we also haven't documented anywhere the "the built in SQL dialect tries to follow postgresql semantics when possible" |
Some ideas for potential criteria
|
Great to start the discussion of this. I have questions about this too. I think we don't really need to have all the functions in datafusion. I think core functions like |
@alamb I fully agree with your recommendation. It maintains the power of DataFusion while avoiding too much complexity. In my mind (and I think the project), DataFusion is first and foremost an extensible query engine, so that many new things can be implemented as a result. That purpose means the core features should be limited to those things that enable extensibility, rather than trying to bundle it all into DataFusion itself. |
thanks for starting this discussion @alamb. this was a very clearly missing clarification. let's get is discussed!
@cisaacson very good point!
This implies reasonable collection of functions to be bundled to make the engine useful for the end-user. For people building things on top of DataFusion, core performance & extensibility are must-haves. If we want DataFusion to be "just" extensible query engine, we can stop here. For users, functionality (broadly speaking) and out of the box experience are must-haves.
This is an obviously good call. In practice we will face trade-offs like first-row latency vs throughput, that depends on the intended typical use-case. Hopefully this is not too often. |
I think only maintain SQL features which are also built-in features in PostgreSQL is a good idea (also a very clear criteria) |
I think the reason why DuckDB is also taken into consideration is that when we start the array function, we found that OLAP style db is a much more suitable choice to follow than Postgres. #6855 Therefore, I'm not sure if we only stick with Postgres Only is a good idea. We might need to discuss it case by case. |
I agree -- this is a good point. There are certain feature sets (
I think it would be an excellent idea to make sure we don't end up with some "hard and fast rule that must always be followed" -- ensuring we can continue to evaluate each idea on a case by case basis is a great point. Maybe in these cases the "bar" is higher (like a good amount of the community thinks it is an important and widely applicable feature π€ ) |
I agree with keeping the DataFusion core simple and focused. I am thinking whether we should maintain an index service or something like VSCode marketplace to showcase third-party extensions developed by other users and make it easy for users to find the extensions they need. These extensions display different properties based on different types, such as TableProvider or UDF. We may need to do some work to make integrating extensions into DataFusion easier. |
I like the analogy that DataFusion is to query engines what LLVM is to programming languages. (I think I heard Andrew say that once?) Although the analogy isn't perfect, because you can use DataFusion out of the box for a great SQL query experience whereas LLVM (to my knowledge) requires writing a non-trivial amount of code to integrate with it. Actually I think its because DataFusion has such a great out of box experience, that people want to naturally add to it to make it even better.
This is an important distinction, and where we need to decide if we want to be more like LLVM (i.e. focus on people building things on top of DataFusion) or something that attracts users directly. I don't think that those are mutally exclusive (i.e. most users probably are people building on top of DataFusion) - but I do think it makes sense to focus more on the core part of what makes DataFusion great as mentioned above.
Yes, I think part of the solution here is to make it very easy to discover extensions that add to the base DataFusion functionality. I think part of why its tempting to add new features to DataFusion core is that it makes it more discoverable by default/provides a natural coordination point for implementing a set of functionality. As a concrete example, when I was first integrating with DataFusion for my project I needed the ability to translate DataFusion expressions back into raw SQL strings to implement a TableProvider. I found the |
I agree -- this idea is somewhat mentioned in https://datafusion.apache.org/user-guide/faq.html#how-does-datafusion-compare-with-xyz as well:
I agree with @phillipleblanc and @jonahgao -- here is a proposal to try and make it easier to discover extensions: It isn't quite as easy as VSCode marketplace (or the newly announced DuckDB community extensions: https://community-extensions.duckdb.org/) but it is a start. I also very much hope that the https://github.com/datafusion-contrib/datafusion-tui project @matthewmturner and I are working on will become an example / easy to start from place for pre-cooked integrations which will help with discoverability. We still have a ways to go but I am feeling bullish. |
Thank you for starting this discussion. I really agree with this concise statement:
When we first joined the project (almost two years ago now), it took us some time to internalize/digest this approach as our first instinct was to contribute as much as we can upstream. However, I can safely say that following this guideline helped us with our engineering too -- it forces one to think about the right boundaries between components, what belongs to the core, etc. |
One vote here for the other use case. I'd like datafusion to be usable as a single node query engine (alongside a nice dataframe api). This is in works within the datafusion-python bindings, but I'd personally love for this use case to gain as much priority as datafusion as a library to build other db products on top of. I really think with a combination of really strong python bindings (and ensuring that all extension points are also appropriately exposed to python), #4285, and a lot of work into making the docs and the python bindings as nice as polars. Datafusion could become the go to solution for ETL/OLAP/ML/data engineering/etc. use cases. DataFusion has a lot of really excellent foundational engineering. How it's used by so many downstream DB engines attests strongly to that. I think it's a real shame that it isn't quite as suitable for the role that pandas/dask/polars/duckdb currently occupies. This isn't due to anything lacking in the query engine, but the overall user experience for a direct user isn't quite as solid (as opposed to someone using it as a library). |
Thank you @kszlim -- This is well stated, and I think this is one of the core tensions that has existed in the project from the early days One way to go is as you suggest and try and make datafusion the superset of all that is good about polars (python dataframes) and duckdb (sql). I worry that this will result in an even larger library that will never be good as either. Another potential way is to keep the core focused on fundamentals and work to provide open source alternatives to those other libraries built on datafusion. It is my not-so-secret goal with the following discussions:
I am hopeing to see datafusion-python (or maybe a library built on datafusion-python) and The benefit if keeping the core more focused is that it would make it easier to embed and have more usecases, thus drawing more users and thus contributors back. |
I feel like we've made a ton of progress on this in datafusion-python 40 and 41. As someone who is also using datafusion-python in my project, I can already feel the huge usability improvements that make my day to day work more enjoyable. Now, I'm probably biased since I am focusing on building those as I need them for my projects. But the type hinting, simpler apis, html rendering in notebooks, and rust udfs in python all have made a really different experience from when I first started to use it. The point I'm still struggling with right now is the extension points and how those can/should fit into the python bindings. There are some parts that are trivially easy to do and some parts that are not supported. I should probably open an issue to find out what all of the extensions people would like to see in the python bindings. That's a bit of an aside from the central discussion here. My thoughts on the core question is much in line with what @alamb suggests above about supporting core features and a minimal set of extensions to demonstrate the usability. |
I think all great software is created by someone who is in some way building it for themselves and has an intuitive understanding of what is needed. I am very glad you have started to help craft datafusion-python this way |
This is my use case - datafusion is an embedded query engine which I use via it's dataframe api. I have a very small set of changes that I've made to datafusion in a branch but for the most part I use it as it is. |
FYI we created https://github.com/datafusion-contrib/datafusion-functions-extra as a home for extra functions to try and organize our efforts to make new functions outside the core of datafusion See #12254 (comment) for more details |
In case it isn't obvious, one of my goals with encouraging / setting up other repositories is to provide an outlet for contributions that isn't the datafusion core I don't want the answer to be "no we don't want them" -- I just think the answer can't be "put them in the datafusion core" for everything (mostly to keep the maintenance of the project manageable) |
@alamb This is a great way to do this, it allows the core of DataFusion to keep its focus. I support this approach, and it allows other tools to be added that rely on DataFusion. |
Unless anyone has further comments, I hope to make a PR codifying the discussion above into the documentation over the next week or two |
I don't know if this is the correct thread, and maybe I am just bad at searching - but I spent at least a few hours trying to figure out if it's possible to create & register custom DDL, for instance (just a silly example to get the point across) create TACO as t WITH toppings ( ... ); or perhaps something entirely different without the keyword In reality, it might be to register external secret managers (similar to duckdb's I suspect it might eventually be covered in the unfinished section here though https://datafusion.apache.org/library-user-guide/extending-operators.html, but I thought to ask either way here for good measure. |
Hi @mkarbo -- DataFusion actually has its own SQL dialect that was implemented as a small extension to the sqlparser https://docs.rs/datafusion/latest/datafusion/sql/parser/struct.DFParser.html I think you can take a look at how DataFusion does it -- namely parse the token stream yourself (unless you need some token that is not defined in sqlparser-rs) and delegate to sqlparser-rs if it isn't your special DDL Then you have a |
Is your feature request related to a problem or challenge?
DataFuson is growing by almost all measures: community π€ , features πͺΆ , and codebase size β which is good π However, this growth is causing challenges such as:
As described in the Design Goals, it is important for DataFusion to:
However, this description doesn't offer any specific criteria about which features should be in the core (to work "out of the box") and which should be implemented as extensions
I am worried that if we take all possiblely useful features, the DataFusion core will become unmanageble / unmaintainable. Already we are struggling with review capacity (it takes days / weeks to review new feautre PRs)
Describe the solution you'd like
I would like a clearly articulated set of criteria of when features should be added to the core vs when they should be in downstream projects / crates built with the extension APIs
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: