Update explain plan to show when topk operator is used #7750

alamb · 2023-10-05T20:29:20Z

Is your feature request related to a problem or challenge?

After #7721 a SortExec with a limit will use a special TopK operator

However, I don't think it is clear from the EXPLAIN plan that this will be used (you have to know that sort with a limit is specially optimized to not actually sort).

Using DataFusion CLI and the dataset described here: #7721 (review), the explain plan looks like

❯ explain select trace_id from 'traces.parquet' order by time desc limit 10;
+---------------+-------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                              |
+---------------+-------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: traces.parquet.trace_id                                                                               |
|               |   Limit: skip=0, fetch=10                                                                                         |
|               |     Sort: traces.parquet.time DESC NULLS FIRST, fetch=10                                                          |
|               |       Projection: traces.parquet.trace_id, traces.parquet.time                                                    |
|               |         TableScan: traces.parquet projection=[time, trace_id]                                                     |
| physical_plan | ProjectionExec: expr=[trace_id@0 as trace_id]                                                                     |
|               |   GlobalLimitExec: skip=0, fetch=10                                                                               |
|               |     SortExec: fetch=10, expr=[time@1 DESC]         <----- ****  this uses the TopK operator |                                                               
|               |       ProjectionExec: expr=[trace_id@1 as trace_id, time@0 as time]                                               |
|               |         ParquetExec: file_groups={1 group: [[traces.parquet]]}, projection=[time, trace_id] |
|               |                                                                                                                   |
+---------------+-------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

Describe the solution you'd like

I would like to make it clearer in the explain plan that this new operator is used. Perhaps something like

SortExec: TopK(fetch=10), expr=[time@1 DESC]         <----- **** This is updated |

See comment in https://github.com/apache/arrow-datafusion/pull/7721/files#r1343969678 for exactly where this code that controls the output is

Describe alternatives you've considered

We can leave the explain plans alone, but I think that is confusing

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2023-10-05T20:30:00Z

I think this is an excellent task for new users as the code is quite straightforward and most of this work will be to learn how to run the various tests and update them for the new explain plans.

solves: apache#7750 Replaced `SortExec: fetch={fetch}, expr=[{}]` with 'SortExec: TopK(fetch={fetch}), expr=[{}]' in [sort.rs](https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs) file

fansehep · 2023-10-10T09:12:58Z

Let me try it. 😄

alamb · 2023-10-10T10:02:40Z

Note there is already a partial version in #7751

With a good comment here: #7751 (comment)

Maybe you can base your efforts on that PR

MayurShirodkarOfficial · 2023-10-12T08:09:45Z

hello !! can i try working on this issue?

alamb · 2023-10-13T15:05:21Z

hello !! can i try working on this issue?

Absolutely -- I think we are just waiting on someone to make a PR that passes CI :)

* Updated sort.rs solves: #7750 Replaced `SortExec: fetch={fetch}, expr=[{}]` with 'SortExec: TopK(fetch={fetch}), expr=[{}]' in [sort.rs](https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs) file * fix: ci --------- Co-authored-by: Pratibhanu Jarngal <[email protected]>

alamb added the enhancement New feature or request label Oct 5, 2023

alamb added the good first issue Good for newcomers label Oct 5, 2023

alamb mentioned this issue Oct 5, 2023

Optimize "ORDER BY + LIMIT" queries for speed / memory with special TopK operator #7721

Merged

This was referenced Oct 6, 2023

Updated sort.rs to show TopK #7751

Closed

Updated sort.rs Night-Amber3301/arrow-datafusion#1

Merged

Updated sort.rs to show TopK #7763

Closed

haohuaijin mentioned this issue Oct 15, 2023

Update explain plan to show TopK operator #7826

Merged

alamb closed this as completed in #7826 Oct 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update explain plan to show when topk operator is used #7750

Update explain plan to show when topk operator is used #7750

alamb commented Oct 5, 2023

alamb commented Oct 5, 2023

fansehep commented Oct 10, 2023

alamb commented Oct 10, 2023

MayurShirodkarOfficial commented Oct 12, 2023

alamb commented Oct 13, 2023

Update explain plan to show when topk operator is used #7750

Update explain plan to show when topk operator is used #7750

Comments

alamb commented Oct 5, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Oct 5, 2023

fansehep commented Oct 10, 2023

alamb commented Oct 10, 2023

MayurShirodkarOfficial commented Oct 12, 2023

alamb commented Oct 13, 2023