-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Glue analyze performance tweaks #8839
Conversation
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good % first commit. I think the main performance improvement you're seeing is from the second commit (batch partition updates).
I also didn't understand the last commit - will leave it to @losipiuk (maybe the commit message could explain the "why")
Having control over executors is useful for Glue because it's a heavily rate-limited service and it's very each to run into failures due to rate-limiting depending on how large your tables are.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueMetastoreModule.java
Outdated
Show resolved
Hide resolved
...ino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueColumnStatisticsProvider.java
Outdated
Show resolved
Hide resolved
.../src/main/java/io/trino/plugin/hive/metastore/glue/DisabledGlueColumnStatisticsProvider.java
Outdated
Show resolved
Hide resolved
.../trino-hive/src/main/java/io/trino/plugin/hive/metastore/SemiTransactionalHiveMetastore.java
Outdated
Show resolved
Hide resolved
.../trino-hive/src/main/java/io/trino/plugin/hive/metastore/SemiTransactionalHiveMetastore.java
Outdated
Show resolved
Hide resolved
The intent was to support parallelism at the partition level but after looking at it again today I don't think that approach was correct. I tried the changes with Thrift and there was a considerable slow down so I'm reworking this a little more to be specific to glue. |
f116b3a
to
6941b9c
Compare
@losipiuk @hashhar please take another look. I extended the changes a bit to limit changes for other metastore implementations, and changed one more method in GlueHiveMetastore to use async/batch methods as appropriate. If I allow for up to 20 writer threads that analyze which was taking 75 seconds before is finishing up in 5 seconds with these changes. Also, CLI is not starting for some reason. |
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/HiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java
Show resolved
Hide resolved
...e/src/main/java/io/trino/plugin/hive/metastore/glue/DefaultGlueColumnStatisticsProvider.java
Outdated
Show resolved
Hide resolved
...ino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueColumnStatisticsProvider.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job. Some comments but overally looks good. Less trivial than I expected.
Can we also bump default number of threads we use for reading/writing stats. I recall @findepi suggested to set it up to ~5 by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. More complicated than expected.
Will do a follow-up review to deep dive re: exception propagation.
...e/src/main/java/io/trino/plugin/hive/metastore/glue/DefaultGlueColumnStatisticsProvider.java
Outdated
Show resolved
Hide resolved
...e/src/main/java/io/trino/plugin/hive/metastore/glue/DefaultGlueColumnStatisticsProvider.java
Outdated
Show resolved
Hide resolved
{ | ||
if (!columnStatistics.isEmpty()) { | ||
if (statisticsUpdates.stream().anyMatch(update -> !update.getColumnStatistics().isEmpty())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super duper nit: Converting to noneMatch
would allow using static reference to the method and prevent the double-negation type thing here.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java
Outdated
Show resolved
Hide resolved
77d05b5
to
14146a9
Compare
...e/src/main/java/io/trino/plugin/hive/metastore/glue/DefaultGlueColumnStatisticsProvider.java
Outdated
Show resolved
Hide resolved
Set<Partition> partitions = batchGetPartitionResult.getPartitions().stream().map(partitionConverter).collect(toImmutableSet()); | ||
Map<Partition, Map<String, HiveColumnStatistics>> statisticsPerPartition = columnStatisticsProvider.getPartitionColumnStatistics(partitions); | ||
|
||
statisticsPerPartition.forEach((partition, columnStatistics) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to double check. If there are no statistics present in Glue for a partition we will still get an entry in map returned by columnStatisticsProvider.getPartitionColumnStatistics
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM. some nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question about exceptions. Looks good % Lukasz's comments.
@losipiuk Reminder to ensure Glue tests get run.
...e/src/main/java/io/trino/plugin/hive/metastore/glue/DefaultGlueColumnStatisticsProvider.java
Show resolved
Hide resolved
...e/src/main/java/io/trino/plugin/hive/metastore/glue/DefaultGlueColumnStatisticsProvider.java
Outdated
Show resolved
Hide resolved
good point. I will send out draft PR from origin to trigger tests. |
14146a9
to
e42511a
Compare
Nits/comments applied. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you also update hive.rst
to reflect the change in default values of hive.metastore.glue.write-statistics-threads
and hive.metastore.glue.read-statistics-threads
?
As far as testing goes, I have a small table in S3 using Glue for the metastore, ~500 rows over 75 partitions. Analyzing that table using a single node setup on my computer took about 75 seconds before the change and tables about 15 seconds after.