Add support for compacting small files for Hive tables #9398

losipiuk · 2021-09-27T11:25:40Z

POC PR: High level review comments. No nit-picking at this phase please.

The PR adds support for ALTER TABLE execute syntax.

On top of that, it adds support for compacting small files for non-transactional, non-bucketed Hive tables.

ALTER TABLE xxxxx EXECUTE compact_small_files WITH(file_size_threshold = ...)

Fixes #9466

findepi

Some initial thoughts.

Sorry for a bunch of low-levels too

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java

core/trino-spi/src/main/java/io/trino/spi/connector/TableProcedureExecutionMode.java

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

findepi · 2021-09-27T12:53:01Z

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

+                }
+            }
+
+            Scope tableScope = analyze(table, scope);


Is this equivalent to visitTable(table, scope)?
seem this will re-resolve table again. can we reusable tableScope (relationtype) creation without doing that?

Yeah. I think we could if recorded in the scope what was the type of analyzed relation (view/materialized view/table). But it does not seem we are recording that information.

We record that in io.trino.sql.analyzer.Analysis#registerTable

Would be good to reuse visitTable logic, since there is a lot going on here: tables, views, materialized views, redirections. Masks and filters -- we could pull them from Analysis too.

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

findepi · 2021-09-27T12:58:50Z

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

+            }
+            node.getWhere().ifPresent(where -> analyzeWhere(node, tableScope, where));
+
+            // analyze ORDER BY


ORDER BY is nice because it allows us to tap into distributed sort when rewriting data
however, plan order by may be too limiting

total ordering may be not required, sorting subsets of data can be sufficient (per file, per certain amount of data, grouped execution)

different ordering schemes (eg z-order) may come useful. we should think how we will model them (an expression?)

It is not used so far. And for local ordering (e.g for Z-ordering) we can express intention via WITH parameters.
e.g

WITH (z_order_columns = ARRAY['a', 'b'])

Given that maybe we should drop support for ORDER BY for now?

alexjo2144

Slowly making my way through the commits, but had a question. Have you thought about how these procedures will interact with Access Control?

alexjo2144 · 2021-10-01T19:33:29Z

core/trino-main/src/main/java/io/trino/metadata/TableProceduresPropertyManager.java

+// TODO maybe refactor AbstractPropertyManager and use as a base so there is less code copied
+public class TableProceduresPropertyManager
+{
+    private final ConcurrentMap<Key, Map<String, PropertyMetadata<?>>> connectorProperties = new ConcurrentHashMap<>();


Why does this need to be Concurrent?

It is modelled after AbstractPropertyManager. I think theoretically the map can be modified by multiple threads in parallel as connectors are registered/unregistered. Not sure if that is really the case now.

Not sure if that is really the case now.

it's currently not.
the connectors are loaded serially during server startup, but it's conceivable that this becomes parallel

losipiuk · 2021-10-04T09:31:42Z

Slowly making my way through the commits, but had a question. Have you thought about how these procedures will interact with Access Control?

Great question. I left it for later and totally forgot about that. I guess the most straightforward approach would be to add
checkCanExecuteTableProcedure(SecurityContext context, QualifiedObjectName tableName, QualifiedObjectName procedureName) to AccessControl. I will try to add that in.

losipiuk · 2021-10-04T10:20:11Z

Slowly making my way through the commits, but had a question. Have you thought about how these procedures will interact with Access Control?

Great question. I left it for later and totally forgot about that. I guess the most straightforward approach would be to add checkCanExecuteTableProcedure(SecurityContext context, QualifiedObjectName tableName, QualifiedObjectName procedureName) to AccessControl. I will try to add that in.

Added "Add access control for table procedures" commit.

losipiuk · 2021-10-04T14:04:16Z

@findepi, @electrum, @martint would great to get a review from one of you guys on this one :)

findepi

up to "Add support for table procedures SPI calls to Metadata"

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java

core/trino-spi/src/main/java/io/trino/spi/connector/TableProcedureMetadata.java

findepi · 2021-10-04T15:06:12Z

core/trino-main/src/main/java/io/trino/metadata/TableProceduresPropertyManager.java

+// TODO maybe refactor AbstractPropertyManager and use as a base so there is less code copied
+public class TableProceduresPropertyManager
+{
+    private final ConcurrentMap<Key, Map<String, PropertyMetadata<?>>> connectorProperties = new ConcurrentHashMap<>();


Not sure if that is really the case now.

it's currently not.
the connectors are loaded serially during server startup, but it's conceivable that this becomes parallel

findepi · 2021-10-04T15:14:13Z

core/trino-main/src/main/java/io/trino/metadata/TableProceduresPropertyManager.java

+import static java.util.Locale.ENGLISH;
+import static java.util.Objects.requireNonNull;
+
+// TODO maybe refactor AbstractPropertyManager and use as a base so there is less code copied


Seems like what you need is to make AbstractPropertyManager<K> (K - key)

and add public method in subclasses that would convert API to internal key

Yeah. The use would be somewhat less-nice. But it will work. Do you want me to update PR towards that?

findepi · 2021-10-04T15:14:18Z

core/trino-main/src/main/java/io/trino/metadata/TableProceduresPropertyManager.java

+        return properties.build();
+    }
+
+    private Object evaluatePropertyValue(


this could be static and shared with AbstractPropertyManager

core/trino-main/src/main/java/io/trino/metadata/TableProceduresPropertyManager.java

core/trino-main/src/main/java/io/trino/metadata/TableExecuteHandle.java

findepi

"Add parser/analyzer support for ALTER TABLE ... EXECUTE"

core/trino-main/src/main/java/io/trino/sql/analyzer/Analysis.java

findepi · 2021-10-04T15:21:45Z

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

+                }
+            }
+
+            Scope tableScope = analyze(table, scope);


We record that in io.trino.sql.analyzer.Analysis#registerTable

Would be good to reuse visitTable logic, since there is a lot going on here: tables, views, materialized views, redirections. Masks and filters -- we could pull them from Analysis too.

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

core/trino-parser/src/main/java/io/trino/sql/SqlFormatter.java

findepi · 2021-10-04T15:29:50Z

core/trino-parser/src/main/java/io/trino/sql/SqlFormatter.java

+        @Override
+        protected Void visitTableExecute(TableExecute node, Integer indent)
+        {
+            builder.append("ALTER TABLE ");


if you ignore indent (since it must be 0 here), add checkArgument(indent==0,"...")

same for others, preexistings (separate pr)

I think this is true now. But this is not obvious to me that it must always be true.
E.g. EXPLAIN ALTER ... does not bump indent. But it could do so.

findepi

"Add access control for table procedures"

core/trino-main/src/main/java/io/trino/security/AccessControl.java

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

core/trino-spi/src/main/java/io/trino/spi/security/AccessDeniedException.java

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorAccessControl.java

losipiuk · 2021-10-05T11:22:08Z

I sent out first batch of fixups. Let me know if I prefer to keep it this way for a while. Or should I squash those in?

losipiuk · 2021-10-05T16:42:24Z

core/trino-main/src/main/java/io/trino/execution/scheduler/SourcePartitionedScheduler.java

+                    // todo: would be great to call getTableExecuteSplitsInfo if we are executing plan which requires that.
+                    TableExecuteContext tableExecuteContext = tableExecuteContextManager.getTableExecuteContextForQuery(stage.getStageId().getQueryId());
+                    Optional<List<Object>> tableExecuteSplitsInfo = splitSource.getTableExecuteSplitsInfo();
+                    tableExecuteSplitsInfo.ifPresent(tableExecuteContext::setSplitsInfo);


doing it here seems wrong as it looks like we can have multiple SourcePartitionedSchedulers per single query. E.g. those can be created via FixedSourcePartitionedScheduler. Need a rework.
cc: @findepi

findepi

"fixups"

core/trino-main/src/main/java/io/trino/metadata/TableProceduresRegistry.java

core/trino-spi/src/main/java/io/trino/spi/connector/TableProcedureMetadata.java

core/trino-main/src/main/java/io/trino/sql/analyzer/Analysis.java

findepi

"fixups"

findepi · 2021-10-08T07:38:22Z

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

@@ -1034,6 +1034,8 @@ protected Scope visitTableExecute(TableExecute node, Optional<Scope> scope)
                    analysis.getParameters());
            analysis.setTableExecuteProperties(tableProperties);

+            analysis.setUpdateType("EXECUTE", tableName, Optional.empty(), Optional.empty());


is this used for https://trino.io/docs/current/sql/execute.html ?
can the two be mistaken?

I do not believe EXECUTE you mentioned classifies as an update-statement. And in code we are not passing "EXECUTE" ever to io.trino.execution.QueryStateMachine#setUpdateType.
I can put something else here (not sure how much it matters), but EXECUTE was matching what we do for other ALTER ... statements. E.g. for ALTER TABLE ... ADD COLUMN we put ADD COLUMN as update type.

findepi

"Add planner logic for TableExecute statement"

lgtm, however you may want to consult this particular commit with @kasiafi too

findepi · 2021-10-08T08:35:34Z

core/trino-main/src/main/java/io/trino/sql/planner/LogicalPlanner.java

+                // TODO support broader range.
+                throw new TrinoException(NOT_SUPPORTED, "only predicates expressible as TupleDomain can be used with ALTER TABLE ... EXECUTE");


I am unsure about it. First, this is gonna change (eg #7994), but it's OK to support less. Correctness first.
However, expressability with a TD doesn't guarantee predicate will be subsumed. For example, Hive connector doesn't consume TD over non-partition columns.
Thus it feels the check here, at this stage, doesn't guarantee anything and we need a different check in a different place. So i hope we can remove the check here.

Yeah - that is fair point that even with check here we depend on the connector, that it actually consumes the predicate instead of ignoring it. And if it cannot for some reason it should throw an exception.
I am not sure if we can provide any engine-side validation if connector behaves up to contract to be honest.

If you kept the predicate in the form of Expression through the Optimizer phase, some optimizations could be applied and transform a not supported predicate into a supported one.

Good point. Though then I would need to structure SPI very differently. The getTableHandleForExecute would not take the constraint parameter and we would depend on applyFilter to do the job. Yet then we would need to have some mechanism (validation optimizer rule?) to verify that at the end there is no filter between TableScanNode and TableWriterNode if TableScanNode if we are in executing ALTER TABLE EXECUTE.

It feels doable, though more complex and I am not sure if we are getting the true benefit, given the fact that the condition passed in the WHERE clause will most probably be simplistic (conjunction of range predicates?).

@kasiafi do you feel strongly about that?

we would depend on applyFilter to do the job. Yet then we would need to have some mechanism (validation optimizer rule?) to verify that at the end there is no filter

If I understand correctly, we still need some validation that the whole constraint is consumed? Maybe it would be good to use existing mechanisms for pushing predicate / handling the non-accepted part?

However, if the Constraint is built here, I was thinking if this validation could go to the Analyzer.

If I understand correctly, we still need some validation that the whole constraint is consumed

We cannot really validate that. The contract is that connector should throw exception if the constraint cannot be consumed fully. We can discuss if contract is nice when it comes to SPI shape, but we cannot do anything if connector does not obey it (does not consume, and does not throw). If I change the approach to depend on applyFilter we still cannot do any validation. If connector does not consume predicate, yet it returns empty remainingFilter in ConstraintApplicationResult the engin will not know.

Maybe it would be good to use existing mechanisms for pushing predicate / handling the non-accepted part

It feels to me it would be a bit nicer. And (as you said) handle more predicate shapes. On the other hand the logic of single procedure would be even more spread around the codebase, and harder to follow. I would start with proposed appraoch and refactor as a followup if we decide it is worth it.

However, if the Constraint is built here, I was thinking if this validation could go to the Analyzer

Not sure I fully understand what you suggest here. Let's chat on slack.

kasiafi · 2021-10-08T13:51:12Z

core/trino-main/src/main/java/io/trino/sql/planner/LogicalPlanner.java

+                assignments.build(),
+                false,
+                Optional.empty());
+        RelationPlan tableScanPlan = new RelationPlan(


Could you use RelationPlanner to process the table and get the RelationPlan? The above mostly duplicates the code of RelationPlanner.visitTable().

It is not that straightforward. visitTable in RelationPlanner would take TableHandle to be used with TableScanNode from analysis. This is not what I want.
I want to use TableHandle from TableExecuteHandle.sourceTableHandle. Any suggestions how should i proceed?

I guess I can modify analysis object I have before calling out to RelationPlanner (or rather create new modified one base on the we got on call to LogicalPlanner.planStatement).

The other option would be to create a public helper RelationPlanner.planTableWithHandle(Table table, TableHandle handle) and make it share code with RelationPlanner.visitTable.

Leaning towards latter. WDYT?

Oh ... current code is actually messed up.
I am using TableExecuteHandle.sourceTableHandle in the TableScanOperator but I am using ColumnHandles taken from analysis; and the two may not be compatible.

for (Field field : scope.getRelationType().getAllFields()) { Symbol symbol = symbolAllocator.newSymbol(field); outputSymbolsBuilder.add(symbol); assignments.put(symbol, analysis.getColumn(field)); }

I need to wrap my head around it :/

Oh ... current code is actually messed up.

Or maybe that is not a problem. We have pieces of code already when we change TableHandle in TSO but still use old ColumnHandles (e.g. after applyFilter).

Another question: Is this fine to assume here that order of symbols in plan I got for planning TS for Table matches the order of ColumnHandles i got from ConnectorTableMetadata?

kasiafi · 2021-10-08T13:56:01Z

core/trino-main/src/main/java/io/trino/sql/planner/LogicalPlanner.java

+                // TODO support broader range.
+                throw new TrinoException(NOT_SUPPORTED, "only predicates expressible as TupleDomain can be used with ALTER TABLE ... EXECUTE");


If you kept the predicate in the form of Expression through the Optimizer phase, some optimizations could be applied and transform a not supported predicate into a supported one.

core/trino-main/src/main/java/io/trino/sql/planner/LogicalPlanner.java

findepi

"Pass splits info to TableFinish operator for ALTER TABLE EXECUTE"

core/trino-main/src/main/java/io/trino/execution/SqlTaskManager.java

core/trino-main/src/main/java/io/trino/execution/TableExecuteContext.java

core/trino-main/src/main/java/io/trino/operator/TableFinishOperator.java

core/trino-main/src/main/java/io/trino/split/SampledSplitSource.java

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java

findepi · 2021-10-11T12:44:11Z

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java

     */
-    default void finishTableExecute(ConnectorSession session, ConnectorTableExecuteHandle tableExecuteHandle, Collection<Slice> fragments)
+    default void finishTableExecute(ConnectorSession session, ConnectorTableExecuteHandle tableExecuteHandle, Collection<Slice> fragments, List<Object> tableExecuteState)


Since io.trino.spi.connector.ConnectorSplitSource#getTableExecuteSplitsInfo returns Optional, i think this should be Optional<List> too

Current contract is that for TableExecute flow ConnectorSplitSource must return non-empty optional here. Hence starting from TableExecuteContext we do not have any Optionals. I think this is simpler this way.

findepi

"Add tableExecuteSplitsInfo to FixedSplitSource"

core/trino-spi/src/main/java/io/trino/spi/connector/FixedSplitSource.java

Add support for compacting small files for non-transactional, non-bucketed Hive tables. ALTER TABLE xxxxx EXECUTE OPTIMIZE WITH(file_size_threshold = ...)

findepi · 2021-10-12T15:40:37Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+            if (!someDeleted) {
+                throw new TrinoException(HIVE_FILESYSTEM_ERROR, "Error while deleting ", e);
+            }
+            log.error(e, "Error while deleting data files in FINISH phase of OPTIMIZE; remaining files need to be deleted manually:  " + tableExecuteState);


throwing is bad and not throwing isn't great either. this is irrecoverable by us and requires user intervention

i think throwing is a better idea than logging, as it at least ensures problem is surfaced to the person invoking the procedure (query)

Yeah - I very much would prefer to throw. Let me see. Maybe we can throw and still skip cleanup.

PTAL now (last 2 commits)

…store

losipiuk · 2021-10-19T08:03:54Z

Replaced with: #9665

cla-bot bot added the cla-signed label Sep 27, 2021

losipiuk requested review from findepi, electrum and martint September 27, 2021 11:25

findepi reviewed Sep 27, 2021

View reviewed changes

findepi mentioned this pull request Sep 27, 2021

Add ALTER TABLE SET PROPERTIES statement #9401

Merged

losipiuk force-pushed the lo/distributed-dml-1 branch from 8cad0ce to f9ef708 Compare September 29, 2021 14:02

alexjo2144 reviewed Oct 1, 2021

View reviewed changes

losipiuk force-pushed the lo/distributed-dml-1 branch from f9ef708 to b58f58b Compare October 4, 2021 10:19

findepi reviewed Oct 4, 2021

View reviewed changes

losipiuk force-pushed the lo/distributed-dml-1 branch from b58f58b to a6d68f5 Compare October 5, 2021 11:21

losipiuk force-pushed the lo/distributed-dml-1 branch 3 times, most recently from eae2195 to ec74a7f Compare October 5, 2021 16:02

losipiuk commented Oct 5, 2021

View reviewed changes

losipiuk force-pushed the lo/distributed-dml-1 branch 2 times, most recently from 39c46e2 to dd93cc4 Compare October 7, 2021 19:36

findepi reviewed Oct 8, 2021

View reviewed changes

losipiuk force-pushed the lo/distributed-dml-1 branch from dd93cc4 to b3947f4 Compare October 8, 2021 09:55

kasiafi reviewed Oct 8, 2021

View reviewed changes

findepi reviewed Oct 11, 2021

View reviewed changes

core/trino-spi/src/main/java/io/trino/spi/connector/FixedSplitSource.java Show resolved Hide resolved

core/trino-spi/src/main/java/io/trino/spi/connector/FixedSplitSource.java Show resolved Hide resolved

losipiuk added 22 commits October 12, 2021 15:34

fixup! Add access control for table procedures

f407be1

Define QueryExecution for TableExecute statement

9597bb2

Add planner logic for TableExecute statement

1074941

fixup! Add planner logic for TableExecute statement

1a09c45

Do not output rows count from ALTER TABLE EXECUTE

63b98da

Pass splits info to TableFinish operator for ALTER TABLE EXECUTE

3d63c92

Prefer toImmutableList in HiveMetadata

11d30ca

Add Hive OPTIMIZE table procedure

898fa8c

Add support for compacting small files for non-transactional, non-bucketed Hive tables. ALTER TABLE xxxxx EXECUTE OPTIMIZE WITH(file_size_threshold = ...)

fixup! Add TableProceduresRegistry to Metadata

f56191c

fixup! Add SPI for table procedures

77e91a3

fixup! Add parser/analyzer support for ALTER TABLE ... EXECUTE

5138fff

fixup! Add planner logic for TableExecute statement

3f1586a

fixup! Add TableProceduresPropertyManager

e77c2bc

fixup! Add planner logic for TableExecute statement

62aa891

fixup! Pass splits info to TableFinish operator for ALTER TABLE EXECUTE

04ad808

fixup! Pass splits info to TableFinish operator for ALTER TABLE EXECUTE

d14457c

fixup! Pass splits info to TableFinish operator for ALTER TABLE EXECUTE

b7c9b9b

fixup! Pass splits info to TableFinish operator for ALTER TABLE EXECUTE

2724556

fixup! Add Hive OPTIMIZE table procedure

21bc91c

fixup! Add SPI for table procedures

012bb1a

Add tableExecuteSplitsInfo to FixedSplitSource

a047359

Allow filtering on partitions for Hive OPTIMIZE table procedure

210b395

losipiuk force-pushed the lo/distributed-dml-1 branch from 8e53110 to 210b395 Compare October 12, 2021 13:35

fixup! Add parser/analyzer support for ALTER TABLE ... EXECUTE

5e43e3e

findepi reviewed Oct 12, 2021

View reviewed changes

losipiuk added 2 commits October 12, 2021 22:05

Allow dropping declared write intentions in SemiTransactionalHiveMeta…

96d211f

…store

Add logic to limit chance for data loss in Hive OPTIMIZE table procedure

262ebf3

losipiuk force-pushed the lo/distributed-dml-1 branch from cb30cf5 to 262ebf3 Compare October 12, 2021 20:33

losipiuk closed this Oct 19, 2021

		// TODO support broader range.
		throw new TrinoException(NOT_SUPPORTED, "only predicates expressible as TupleDomain can be used with ALTER TABLE ... EXECUTE");

Add support for compacting small files for Hive tables #9398

Add support for compacting small files for Hive tables #9398

Conversation

losipiuk commented Sep 27, 2021 • edited by findepi Loading

POC PR: High level review comments. No nit-picking at this phase please.

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk Sep 27, 2021 • edited Loading

Choose a reason for hiding this comment

alexjo2144 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk commented Oct 4, 2021 • edited Loading

losipiuk commented Oct 4, 2021

losipiuk commented Oct 4, 2021

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

losipiuk commented Oct 5, 2021

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk Oct 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk Oct 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk Oct 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk commented Oct 19, 2021

losipiuk commented Sep 27, 2021 •

edited by findepi

Loading

losipiuk Sep 27, 2021 •

edited

Loading

losipiuk commented Oct 4, 2021 •

edited

Loading

losipiuk Oct 8, 2021 •

edited

Loading

losipiuk Oct 11, 2021 •

edited

Loading

losipiuk Oct 11, 2021 •

edited

Loading