Glue analyze performance tweaks #8839

alexjo2144 · 2021-08-09T20:16:07Z

As far as testing goes, I have a small table in S3 using Glue for the metastore, ~500 rows over 75 partitions. Analyzing that table using a single node setup on my computer took about 75 seconds before the change and tables about 15 seconds after.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java

hashhar

Looks good % first commit. I think the main performance improvement you're seeing is from the second commit (batch partition updates).

I also didn't understand the last commit - will leave it to @losipiuk (maybe the commit message could explain the "why")

Having control over executors is useful for Glue because it's a heavily rate-limited service and it's very each to run into failures due to rate-limiting depending on how large your tables are.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueMetastoreModule.java

...ino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueColumnStatisticsProvider.java

.../src/main/java/io/trino/plugin/hive/metastore/glue/DisabledGlueColumnStatisticsProvider.java

.../trino-hive/src/main/java/io/trino/plugin/hive/metastore/SemiTransactionalHiveMetastore.java

alexjo2144 · 2021-08-10T19:43:34Z

I also didn't understand the last commit - will leave it to @losipiuk (maybe the commit message could explain the "why")

The intent was to support parallelism at the partition level but after looking at it again today I don't think that approach was correct. I tried the changes with Thrift and there was a considerable slow down so I'm reworking this a little more to be specific to glue.

alexjo2144 · 2021-08-10T20:41:45Z

@losipiuk @hashhar please take another look. I extended the changes a bit to limit changes for other metastore implementations, and changed one more method in GlueHiveMetastore to use async/batch methods as appropriate. If I allow for up to 20 writer threads that analyze which was taking 75 seconds before is finishing up in 5 seconds with these changes.

Also, CLI is not starting for some reason.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/HiveMetastore.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java

...e/src/main/java/io/trino/plugin/hive/metastore/glue/DefaultGlueColumnStatisticsProvider.java

...ino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueColumnStatisticsProvider.java

losipiuk

Good job. Some comments but overally looks good. Less trivial than I expected.

losipiuk · 2021-08-11T10:44:08Z

Can we also bump default number of threads we use for reading/writing stats. I recall @findepi suggested to set it up to ~5 by default.

hashhar

Some comments. More complicated than expected.

Will do a follow-up review to deep dive re: exception propagation.

...e/src/main/java/io/trino/plugin/hive/metastore/glue/DefaultGlueColumnStatisticsProvider.java

hashhar · 2021-08-11T20:56:42Z

.../src/main/java/io/trino/plugin/hive/metastore/glue/DisabledGlueColumnStatisticsProvider.java

    {
-        if (!columnStatistics.isEmpty()) {
+        if (statisticsUpdates.stream().anyMatch(update -> !update.getColumnStatistics().isEmpty())) {


super duper nit: Converting to noneMatch would allow using static reference to the method and prevent the double-negation type thing here.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java

alexjo2144 · 2021-08-12T18:27:11Z

Comments addressed, and hopefully code improved a bit. The big thing was adding support for batch/async get statistics calls, which simplified statistics updating a bit. @hashhar @losipiuk

...e/src/main/java/io/trino/plugin/hive/metastore/glue/DefaultGlueColumnStatisticsProvider.java

losipiuk · 2021-08-13T07:39:25Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java

+        Set<Partition> partitions = batchGetPartitionResult.getPartitions().stream().map(partitionConverter).collect(toImmutableSet());
+        Map<Partition, Map<String, HiveColumnStatistics>> statisticsPerPartition = columnStatisticsProvider.getPartitionColumnStatistics(partitions);
+
+        statisticsPerPartition.forEach((partition, columnStatistics) -> {


Just to double check. If there are no statistics present in Glue for a partition we will still get an entry in map returned by columnStatisticsProvider.getPartitionColumnStatistics?

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java

losipiuk

Thanks. LGTM. some nits

hashhar

A question about exceptions. Looks good % Lukasz's comments.

@losipiuk Reminder to ensure Glue tests get run.

...e/src/main/java/io/trino/plugin/hive/metastore/glue/DefaultGlueColumnStatisticsProvider.java

losipiuk · 2021-08-13T08:45:37Z

@losipiuk Reminder to ensure Glue tests get run.

good point. I will send out draft PR from origin to trigger tests.

alexjo2144 · 2021-08-13T15:07:42Z

Nits/comments applied. Thanks

hashhar

LGTM. Can you also update hive.rst to reflect the change in default values of hive.metastore.glue.write-statistics-threads and hive.metastore.glue.read-statistics-threads?

cla-bot bot added the cla-signed label Aug 9, 2021

alexjo2144 requested review from losipiuk and hashhar August 9, 2021 20:16

alexjo2144 commented Aug 9, 2021

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java Show resolved Hide resolved

hashhar reviewed Aug 10, 2021

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java Show resolved Hide resolved

losipiuk reviewed Aug 10, 2021

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueMetastoreModule.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 10, 2021

View reviewed changes

...ino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueColumnStatisticsProvider.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 10, 2021

View reviewed changes

.../src/main/java/io/trino/plugin/hive/metastore/glue/DisabledGlueColumnStatisticsProvider.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 10, 2021

View reviewed changes

.../trino-hive/src/main/java/io/trino/plugin/hive/metastore/SemiTransactionalHiveMetastore.java Outdated Show resolved Hide resolved

losipiuk reviewed Aug 10, 2021

View reviewed changes

.../trino-hive/src/main/java/io/trino/plugin/hive/metastore/SemiTransactionalHiveMetastore.java Outdated Show resolved Hide resolved

alexjo2144 force-pushed the parallelize-analyze branch 3 times, most recently from f116b3a to 6941b9c Compare August 10, 2021 20:36