Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MicroPlans - Allow choosing between different plans on the split level #13534

Open
assaf2 opened this issue Aug 7, 2022 · 4 comments
Open

MicroPlans - Allow choosing between different plans on the split level #13534

assaf2 opened this issue Aug 7, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@assaf2
Copy link
Member

assaf2 commented Aug 7, 2022

There are certain connectors that can't always make pushdown guarantees on the coordinator level. For example, ORC files may contain headers describing the min and max values within a certain row group. We want to give Hive and Iceberg connectors the ability to tell the engine for each split if a certain filter is supported. For example, let's assume the engine pushes down the filter col > 1, Hive\Iceberg could respond with 2 plans - one that consumes the filter and another that doesn't. Then, for each split in which the filter is always true (in our example, when the min value is greater than 1), Hive\Iceberg would use the plan that consumes the filter.
This approach is a step directing to exploratory optimizer.

In general, we want to give connectors the ability to choose between different plans on the split level.
A new abstraction will be introduced - MicroPlanHandle (temporary name) which is a handle to a transformation of a table.
For certain metadata APIs, connectors will have the ability to return several MicroPlanHandles.
These APIs will only be those that affect the plan on the stage level (for example, they can’t affect any Exchange operation).
The optimization process will run in 2 phases. The first phase is the current optimization process. The second phase is where MicroPlanHandles are taken into consideration. At any given time, the connector won’t have the knowledge of which phase is running.

Changes in ConnectorMetadata

The signatures of the following APIs will be deprecated (and eventually removed) and replaced with:

Optional<List<(MicroPlanHandle, ResidualFilter)>> applyFilter(TableHandle, Optional<MicroPlanHandle>, Filter)
Optional<List<(MicroPlanHandle, ResidualProjection)>> applyProjection(TableHandle, Optional<MicroPlanHandle>, Projection)
Optional<List<MicroPlanHandle>> applyAggregation(TableHandle, Optional<MicroPlanHandle>, AggregationFunction)

In the first phase, the engine won’t pass a MicroPlanHandle and will only take the first element of the returned list. Trino might create a new table handle using a new API combine(TableHandle, MicroPlanHandle) -> TableHandle. Another option would be to embed the MicroPlanHandle inside the TableHandle.
In the second phase, all the elements in the returned list will be taken. The engine might put a limitation on the amount of MicroPlanHandles a connector can generate by pruning the last MicroPlanHandles off the list.
Therefore, the connector should place the broadest required residual element as the first element and then all the other elements ordered by priority.

Worker SPI

A new argument will be passed into ConnectorPageSourceProvider#createPageSource - Optional<List<Pair<MicroPlanHandle, List<ColumnHandle>>> instead of the existing List<ColumnHandle> argument.

ConnectorPageSource will contain a new method: Optional<Integer> getChosenMicroPlan(). The connector will return the ordinal number of the MicroPlanHandle it has chosen.

Plans that are not used by any split won’t be compiled and cached in the worker (lazy approach).

@assaf2 assaf2 added the enhancement New feature or request label Aug 7, 2022
@findepi
Copy link
Member

findepi commented Aug 8, 2022

@assaf2 can you please add an example describing what a micro plan is?

@assaf2
Copy link
Member Author

assaf2 commented Aug 10, 2022

@findepi it would be an empty interface like ConnectorTableHandle. Connectors will use it to store information about different ways to transform tableHandles. It may contain information about projected columns, filters, aggregations, etc.
For example, here is how push predicate will work:
Assuming the query SELECT * FROM t WHERE partition_col = 1 AND col1 = 2 AND col2 = 3 and that connector X knows for sure that it can handle predicates on partition_col, that it may or may not be able to handle predicates on col1 and that it can't handle predicates on col2.
First optimizer phase -

  • call applyFilter(TableHandle1, Filter:[partition_col=1, col1=2, col2=3]) => connector returns:
    ** MicroPlanHandle1={Filter: [partition_col=1]}, ResidualFilter=[col1=2, col2=3]
    ** MicroPlanHandle2={Filter: [partition_col=1, col1=2]}, ResidualFilter=[col2=3]
  • call combine(TableHandle1, MicroPlanHandle1) => connector returns TableHandle2 {Filter: [partition_col=1]}
  • call applyFilter(TableHandle2, Filter:[col1=2, col2=3]) => connector returns Optional.empty()

Second optimizer phase -

  • call applyFilter(TableHandle2, Filter:[col1=2, col2=3]) => connector returns:
    ** MicroPlanHandle3={Filter: []}, ResidualFilter=[col1=2, col2=3]
    ** MicroPlanHandle4={Filter: [col1=2]}, ResidualFilter=[col2=3]
  • call applyFilter(TableHandle2, MicroPlanHandle3, Filter:[col1=2, col2=3]) => connector returns Optional.empty()
  • call applyFilter(TableHandle2, MicroPlanHandle4, Filter:[col2=3]) => connector returns Optional.empty()

Then, for each split, the connector will use TableHandle2 and choose between MicroPlanHandle3 and MicroPlanHandle4.

@martint do you wish to add something?

@findepi
Copy link
Member

findepi commented Aug 11, 2022

Why does it need to be a new abstraction if all it does is to pass it along with table handle?
Isn't that this information could be just a connector-specific field within the connector's table handle?

@assaf2
Copy link
Member Author

assaf2 commented Aug 11, 2022

Because each MicroPlan will have different plan on the engine (that's why the engine will need to be aware of which MicroPlan was chosen for each split).
The point is to give more flexibility to connectors that can't make guarantees for the entire table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

2 participants