[Epic] Make DataFusion a reliable foundation for building query engines #12723

findepi · 2024-10-02T16:02:15Z

DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format. DataFusion’s target users are developers building fast and feature rich database and analytic systems, customized to particular workloads. See use cases for examples.

“Out of the box,” DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. Python Bindings are also available.

ibid.

DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. You can customize DataFusion at almost all points including additional data sources, query languages, functions, custom operators and more. See the Architecture section for more details.

The two passages indicate dual nature of DataFusion

First, it's a complete query engine, with which users can interact e.g. using datafusion-cli (or dfdb proposed in #11979) or DataFusion's SQL and DataFrames frontends.

Second is what @alamb often refers to as DataFusion being "LLVM for query engines", a building block for other applications, where components are re-usable.

(See also https://datafusion.apache.org/user-guide/faq.html#how-does-datafusion-compare-with-xyz, https://datafusion.apache.org/contributor-guide/architecture.html and https://docs.rs/datafusion/latest/datafusion/index.html#architecture)

A query engine may implement a dialect of SQL that is identical with DataFusion SQL, or different.
DataFusion doesn't need to know the details of the query engine being implemented (it is extensible rather than being union of all the query languages). DataFusion needs to allow expressing different query languages, providing a reliable and dialect-agnostic foundation for applications building on top of DataFusion.

A query engine and a query language have certain attributes

language: syntax
language: scope resolution rules
language: type system, including coercion rules
language: function repertoire
engine: data sources
engine: access control
engine: observability

Internally, such a query engine transforms user queries (according to syntax, scoping, typing rules) into relational algebra operations, optimizes and executes them. Sounds simple, but in reality this is super complex and this is where DataFusion really shines.

This epic issue is a collection of tasks important for achieving this goal. Its description should be expected to be a living document.

On the high level, it aims at separation of concerns. The two roles DataFusion has -- an implementation of DataFusion SQL, a particular SQL dialect along and being reusable building block -- they should be clearly separated so that dialect-specific behavior isn't implicitly inherited by components needed to be dialect-agnostic.

Goals and Overview

As a result DataFusion should have the following layers

Frontend

DataFusion frontend includes DataFusion SQL - DataFusion's implementation of SQL. DataFusion frontend also includes the DataFrame API.

It provides the following functionality

SQL frontend, with the syntax supported being a subset of syntax supported by the sqlparser crate that DataFusion SQL uses
the type system (currently Arrow data types (DataType) but that should change in [Proposal] Decouple logical from physical types #11513)
a repertoire of builtin functions
coercion rules required to make the language useufl (i.e. the coercion rules that exists today)
performs function resolution (taking function repertoire, their signatures and coercion rules into account)
the frontend remains extensible e.g. by adding more functions or providing table sources.
it may support extending the type system (Extension Types #12644)

DataFusion main

DataFusion "main" or "core" represents a dialect- and syntax-agnostic execution query engine library, for building query engines.
It serves as an API for all query engine implementors that decide to build on top of DataFusion.

It provides the following functionality

type system: a simplified version of Arrow DataType ([Proposal] Decouple logical from physical types #11513)
- it is currently assume that allowing extension types (Extension Types #12644), or totally different data types, is not necessary at this layer. The applications building on top of DataFusion will have to translate their type systems to that of DataFusion main
- it should support annotating the DataFusion types with additional attributes or traits, so that custom optimizer rules can reason about them, or DataFusion type system gets reacher
  - for example, for a language with a constrained varchar type, a correct implementation of UnwrapCastInComparison for input CAST(a_number AS varchar(4)) > '1234' should check whether a_number can safely be represented in 4 characters (would the cast fail?). it clearly can't fail for eg int8 type, and clearly can fail for int32 type
plan IR (intermediate representation) with strict typing rules and no implicit coercions (e.g. for a = b types of a and b match exactly)
does not perform type coercions
does not use function signatures (the formal signatures of functions inserted into the plan must match exactly)
does not have a function repertoire (doesn't need it, since function resolution already happened), but may provide implementations of functions that are reusable by the Frontend layer, or by applications building on top of DataFusion
has simple (stripped down) expression language. For example is doesn't have to discern between IS NULL and IS UNDEFINED or have a Wildcard expression, since those things are handled by the frontend layer.
optimizers which take the the IR plan and produce an equivalent better IR plan
since it doesn't do scope resolution, it doesn't need a concept of aliases, but needs to preserve output field names externally assigned (e.g. by the frontend layer)
- see also Support columns having the same alias #6543
- see also Consider introducing unique expression IDs in Logical/Physical plan #8379
- see also Remove Alias from Expr #1468
the DataFusion main remains extensible e.g. by adding custom optimizer rules, or custom plan nodes

DataFusion execution

It provides the following functionality

physical planning (ExecutionPlan, PhysicalExpr)
type system: a simplified version of Arrow DataType ([Proposal] Decouple logical from physical types #11513)
- using Arrow DataType directly is also an option, but would prevent runtime-adaptive execution, see Runtime-adaptive data representation #12720, so simplified types would be strongly preferred, [Proposal] Decouple logical from physical types #11513
has no coercion rules
executes queries and produces results

Tasks

(Tasks to be added here once they are discovered and defined.)

The text was updated successfully, but these errors were encountered:

findepi · 2024-10-02T16:02:24Z

cc @alamb, @andygrove, @jayzhan211, @ozankabak, @notfilippo, @comphead, @sadboy, @milevin, @wizardxz

alamb · 2024-10-02T16:09:44Z

This certainly sounds ambitious (but good!)

Another specific change towards this goal that might be worth considering (and that came in @andygrove 's CMU talk about comet) would be user defined coercion rules

The coerercion / type resolution is pretty specific today and can't be easily extended

findepi · 2024-10-02T16:16:59Z

Thanks @alamb for your feedback! For #12644 i started to prototype a new type system that DataFusion could adopt, with assumptions that all types should be first-class citizens (so a trait Type rather than a enum Type). This obviously leads to type providers having to negotiate the coercion rules, but also makes creating core optimizer rules harder -- it would mean the optimizer doesn't know actual types at all! I am pretty sure this is viable approach, but also pretty complicated. And it doesn't solve all the problems on its own, since at some point we need for "physicalize" the types into some common denominator. This remaining thing being exactly what this issue is about. Building a solid common denominator.

findepi · 2024-10-02T16:17:15Z

Let's put the exact mechanics (Tasks) aside, would be awesome to have agreement on the two things first:

do we want DataFusion to be a reliable building block for building query engines?
if yes, do we want to introduce an architecture supporting that goal?

alamb · 2024-10-02T16:55:21Z

do we want DataFusion to be a reliable building block for building query engines?

Well, I don't think there is / will be much debate about this goal as it is the same as the current goal, in my mind (or maybe I don't understand if it is meant to be different than the current goal)

if yes, do we want to introduce an architecture supporting that goal?

I look at it a little differently, which is "what features / extension points are missing to allow more building with DataFusion" -- like in my mind the current architure already supports this goal, though it has areas that could be improved (e.g. etendibel coercion, user defined types, etc)

jayzhan211 · 2024-10-02T23:55:25Z

I think #12622 is important if we want a nice function signature and coercion rule.

findepi · 2024-10-03T07:52:27Z

Thank you @alamb @jayzhan211 for your comments!

do we want DataFusion to be a reliable building block for building query engines?

Well, I don't think there is / will be much debate about this goal as it is the same as the current goal, in my mind (or maybe I don't understand if it is meant to be different than the current goal)

I do think this is the current goal.
However "an extensible query engine" != "LLVM for query engines". The current official documentation uses the former term. The latter is a bit more ambitious, requiring more predictable semantics.

I look at it a little differently, which is "what features / extension points are missing to allow more building with DataFusion" -- like in my mind the current architure already supports this goal,

Not fully. For example, the DF coercion rules are anywhere. The function signatures are consulted at various phases of the planning. Thus is someone wants to build a system without "everything is coercible to string" behavior, current architecture doesn't allow that.
It is probably easy to fix (conceptually) by defining the layers. The frontend layer (SQL parsing, analysis, creation of the initial plan) is responsible for function resolution and coercions. Beyond that layer we do not repeat same questions.

(e.g. etendibel coercion, user defined types, etc)

I think #12622 is important if we want a nice function signature and coercion rule.

yes to both! plus #12644

but even if we do all of that, we still run on top of Arrow for execution (which is not a bad thing), so we need a layer at which the (simplified) arrow types is the type system (simplified to allow #12720). The physical plans could be the only such layer, but it's a missed optimizations reuse opportunity, they are too physical already. Thus we need a logical plan layer where engines can integrate.

TL;DR
yes, i believe the goal here is inline with project goal.
let's have agreement that we want to invest in code structure / layers making DataFusion reuse reliable and with predictable semantics.

alamb · 2024-10-03T10:56:20Z

However "an extensible query engine" != "LLVM for query engines". The current official documentation uses the former term. The latter is a bit more ambitious, requiring more predictable semantics.

I think in my mind the goal is to be an extensible query engine and an LLVM for data intensive systems. Perhaps that is subtlety different 🤔

Thus is someone wants to build a system without "everything is coercible to string" behavior, current architecture doesn't allow that.

Pedantically I would probably phrase this as "the current code doesn't allow that" -- I think that by adding a suitable API it would be straightforward to allow user defined coercion rules. There may be different opinions on if that would be an architectural change, but I would say it isn't.

alamb · 2024-10-03T11:00:01Z

let's have agreement that we want to invest in code structure / layers making DataFusion reuse reliable and with predictable semantics.

It seems to me like there is broad agreement that:

More flexible (aka user defined) coercion rules would be good.
Support extension types Extension Types #12644 is good
Finding a way to support runtime DataType adaptivity Runtime-adaptive data representation #12720 would be good

Maybe we can treat these as three different features

At least one good next step is probably file a ticket describing what "user defined coercion rules" would look like

findepi · 2024-10-04T12:01:58Z

I don't want this ticket to be about flexible (aka user defined) coercion rules.
I believe this is necessary ingredient for #12644, but can also be tracked separately. My point is that for things like Extension Types (#12644) we need a lower layer which is stripped of those extension types. I don't think it's just physical layer, it would be missed optimizations opportunities cost.
So, with this issue my goal is not to take away any functionality, or forbid adding any functionality. Quite contrary, to allow more functionality to be added, with less breaking changes for downstream consumers, we need some form of "contracts." for various APIs we provide.
Highest level layer should provide all the bells & whistles. Lowest layer is Arrow vectors and kernels. Let's incrementally define the middle layer, as it will make reasoning about the code easier.

I hope there is no disagreement that more code structure, simpler code flow and well defined contracts is a good thing.

findepi · 2024-10-22T10:29:06Z

#13028 change is a great supporting example for this initiative.
We have a language feature we want to expose in DataFusion (#9821), and due to how logical plan construction works today, this mostly-syntax-sugar feature affects logical plan structure, but it has no practical impact on logical plan expressibility, so from LP user perspective, it's a breaking change with no added value.
If we had clear frontend vs core separation -- as being proposed here -- downstream LP consumers would be not be affected by such changes.

findepi changed the title ~~Make DataFusion a reliable foundation for building query engines~~ [Epic] Make DataFusion a reliable foundation for building query engines Oct 2, 2024

findepi mentioned this issue Oct 9, 2024

[Proposal] Decouple logical from physical types #11513

Open

findepi mentioned this issue Oct 22, 2024

feat: support arbitrary expressions in LIMIT plan #13028

Merged

This was referenced Oct 22, 2024

Move TableConstraint to Constraints conversion #12953

Merged

Raise a plan error on union if column count is not the same between plans #13117

Merged

findepi mentioned this issue Nov 5, 2024

Use LogicalType for TypeSignature Numeric and String, Coercible #13240

Merged

alamb mentioned this issue Nov 22, 2024

[DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) #13525

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Make DataFusion a reliable foundation for building query engines #12723

[Epic] Make DataFusion a reliable foundation for building query engines #12723

findepi commented Oct 2, 2024 •

edited

Loading

findepi commented Oct 2, 2024

alamb commented Oct 2, 2024

findepi commented Oct 2, 2024

findepi commented Oct 2, 2024

alamb commented Oct 2, 2024

jayzhan211 commented Oct 2, 2024

findepi commented Oct 3, 2024

alamb commented Oct 3, 2024

alamb commented Oct 3, 2024

findepi commented Oct 4, 2024

findepi commented Oct 22, 2024

[Epic] Make DataFusion a reliable foundation for building query engines #12723

[Epic] Make DataFusion a reliable foundation for building query engines #12723

Comments

findepi commented Oct 2, 2024 • edited Loading

Goals and Overview

Frontend

DataFusion main

DataFusion execution

Tasks

findepi commented Oct 2, 2024

alamb commented Oct 2, 2024

findepi commented Oct 2, 2024

findepi commented Oct 2, 2024

alamb commented Oct 2, 2024

jayzhan211 commented Oct 2, 2024

findepi commented Oct 3, 2024

alamb commented Oct 3, 2024

alamb commented Oct 3, 2024

findepi commented Oct 4, 2024

findepi commented Oct 22, 2024

findepi commented Oct 2, 2024 •

edited

Loading