Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future of experimental optimizer datafusion-tokomak #440

Open
Dandandan opened this issue May 29, 2021 · 5 comments
Open

Future of experimental optimizer datafusion-tokomak #440

Dandandan opened this issue May 29, 2021 · 5 comments
Labels
enhancement New feature or request

Comments

@Dandandan
Copy link
Contributor

Dandandan commented May 29, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This issue is for discussing the future of datafusion-tokomak, an experimental optimizer using the egg library.

It currently allows to optimize Exprs and contains many optimizations currently not done in DataFusion.
I envision it could be extended to support a logical plan or physical plan too.

The optimizer using egg has the following nice properties, which are hard to achieve otherwise:

  • Development of new rules is really easy, most of them can be added in one line, or they could hook up to a trait.
  • Performs more aggressive optimizations than a handwritten optimizer, as it applies multiple rules at once, and can apply and remember rewrites that don't reduce the cost (such as reordering expressions).
  • Supports custom cost functions. It now uses the size of the AST, but could be easily changed to use something else.
  • Is fast for big programs / trees.
  • Is written in Rust, so integrates really well

Some material about it here https://egraphs-good.github.io/

Describe the solution you'd like
Some options:

Integrate it into DataFusion, as an optional feature

Add to DataFusion as separate crate

Keep it in separate repo as is, do some releases to crates.io in sync with DataFusion releases

Add as experimental repo / branch under the Apache organization

Describe alternatives you've considered
n/a

Additional context
Add any other context or screenshots about the feature request here.

@Dandandan Dandandan added the enhancement New feature or request label May 29, 2021
@jorgecarleitao
Copy link
Member

fwiw:

  • Thanks a lot for this!
  • I fully agree
  • Ship it as you see fit; let me know if you get blocked by any bureaucracy

@pjmore
Copy link
Contributor

pjmore commented Nov 6, 2021

Some good and bad news on this front. The Tokomak optimization pass combined with a predicate pushdown pass and a filter<-cross join to filter<-inner join pass is able to handle TPCH Q19 with each iteration taking ~66ms on my laptop. It performs the AstSize cost function transforms the expression into the form that @alamb mentioned in #217. Which gives the optimized logical plan:

Projection: #SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)
  Aggregate: groupBy=[[]], aggr=[[SUM(#lineitem.l_extendedprice * Int64(1) - #lineitem.l_discount)]]
    Projection: #lineitem.l_extendedprice, #lineitem.l_discount
      Filter: Boolean(true)
        Filter: #part.p_container IN ([Utf8("MED BAG"), Utf8("MED BOX"), Utf8("MED PKG"), Utf8("MED PACK")]) AND #part.p_size BETWEEN Int64(1) AND Int64(10) AND #part.p_brand = Utf8("Brand#23") AND #lineitem.l_quantity BETWEEN Int64(10) AND Int64(20) OR #part.p_container IN ([Utf8("LG CASE"), Utf8("LG BOX"), Utf8("LG PACK"), Utf8("LG PKG")]) AND #part.p_brand = Utf8("Brand#34") AND #part.p_size BETWEEN Int64(1) AND Int64(15) AND #lineitem.l_quantity BETWEEN Int64(20) AND Int64(30) OR #lineitem.l_quantity BETWEEN Int64(1) AND Int64(11) AND #part.p_container IN ([Utf8("SM CASE"), Utf8("SM BOX"), Utf8("SM PACK"), Utf8("SM PKG")]) AND #part.p_brand = Utf8("Brand#12") AND #part.p_size BETWEEN Int64(1) AND Int64(5)
          Join: #lineitem.l_partkey = #part.p_partkey
            Filter: #lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]) AND #lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON")
              TableScan: lineitem projection=Some([1, 4, 5, 6, 13, 14])
            TableScan: part projection=Some([0, 3, 5, 6])

Bad news is that I had to use a dev branch of egg due to Send requirements so merging the Tokomak optimizer may have to wait until the next version of egg releases. On the bright side there is plenty of work that needs to be done before I feel the optimizer is ready anyways:

  • Allow users to create their own rules that will run alongside the existing rules.
  • Allow users to create custom cost functions.
  • Allow users to run optimizer with a custom rewrite scheduler.
  • Extend the optimizer to allow it to operate on plans as well as expressions.
  • Add the Case expr, this particular expression does not mesh well with current rewrite syntax.
  • Create DSL which is better suited for rewriting expressions which involve lists and expressions that have to be of a particular type. The current rule syntax falls short when rewriting to or from lists, there is no real way to implement the rule
    col1 = 'one' or col1 = 'two' or col2='three' => col1 InList('one','two','three')
    An example of the need for typed expressions would be when rewriting a filter followed by a cross join into a inner join, as the optimization below is only valid when the :
Filter: #lineitem.l_partkey = #part.p_partkey
    CrossJoin: 
        TableScan: part
        TableScan: lineitem
-> 
Join:  #lineitem.l_partkey = #part.p_partkey
    TableScan: part
    TableScan: lineitem

as the in this case both the left and the right of the filter predicate must be columns for this to be a valid transform.
Another capability that would be nice to have would be to bind to operators in the same way as expression so instead of the rules for rotating the expression tree:

rw!("rotate-add"; "(+ ?x (+ ?y ?z))"=>"(+ (+ ?x ?y) ?z)"),
rw!("rotate-mul"; "(* ?x (* ?y ?z))"=>"(* (* ?x ?y) ?z)"),            
rw!("rotate-and"; "(and ?a (and ?b ?c))"=>"(and (and ?a ?b) ?c)"),
rw!("rotate-or"; "(or ?a (or ?b ?c))"=>"(or (or ?a ?b) ?c)"),

Could instead write something along the lines of:

rw!("rotate-expr-homogenous"; "(?op:(+|*|and|or) ?x (?op ?y ?z))" =>"(?op (?op ?x ?y) ?z)")
  • Determine whether combining expression and plan optimization at the same time is viable. Would probably require custom scheduler. Would complicate writing cost function but it might allow better optimizations to be performed.

I'm going to keep working on this in a separate repo for now because it's a long way from being what I would consider being complete.

@alamb
Copy link
Contributor

alamb commented Nov 6, 2021

Thank you for the update @pjmore -- sounds like some great progress

@Dandandan
Copy link
Contributor Author

Thanks for this @pjmore . Great milestone to have the new query working. Also wondering if some other cross joins are removed too :).
I think at some point it makes sense to "just add" the code to the repo so it becomes easier to continue experimenting. Feel free when you believe it's time to do so :).

@houqp
Copy link
Member

houqp commented Nov 14, 2021

Thanks @pjmore . The DSL looks pretty great, I was thinking about this while reviewing your initial PR the other day. Perhaps something you can upstream to egg as well?

I also agree with @Dandandan that we can merge your code into the main repo early without having to wait for it to be feature complete. Can't wait for this cool feature to land in master :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants