Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glue analyze performance tweaks #8839

Merged
merged 2 commits into from
Aug 16, 2021

Conversation

alexjo2144
Copy link
Member

@alexjo2144 alexjo2144 commented Aug 9, 2021

As far as testing goes, I have a small table in S3 using Glue for the metastore, ~500 rows over 75 partitions. Analyzing that table using a single node setup on my computer took about 75 seconds before the change and tables about 15 seconds after.

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good % first commit. I think the main performance improvement you're seeing is from the second commit (batch partition updates).

I also didn't understand the last commit - will leave it to @losipiuk (maybe the commit message could explain the "why")

Having control over executors is useful for Glue because it's a heavily rate-limited service and it's very each to run into failures due to rate-limiting depending on how large your tables are.

@alexjo2144
Copy link
Member Author

I also didn't understand the last commit - will leave it to @losipiuk (maybe the commit message could explain the "why")

The intent was to support parallelism at the partition level but after looking at it again today I don't think that approach was correct. I tried the changes with Thrift and there was a considerable slow down so I'm reworking this a little more to be specific to glue.

@alexjo2144 alexjo2144 force-pushed the parallelize-analyze branch 3 times, most recently from f116b3a to 6941b9c Compare August 10, 2021 20:36
@alexjo2144
Copy link
Member Author

alexjo2144 commented Aug 10, 2021

@losipiuk @hashhar please take another look. I extended the changes a bit to limit changes for other metastore implementations, and changed one more method in GlueHiveMetastore to use async/batch methods as appropriate. If I allow for up to 20 writer threads that analyze which was taking 75 seconds before is finishing up in 5 seconds with these changes.

Also, CLI is not starting for some reason.

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. Some comments but overally looks good. Less trivial than I expected.

@losipiuk
Copy link
Member

Can we also bump default number of threads we use for reading/writing stats. I recall @findepi suggested to set it up to ~5 by default.

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. More complicated than expected.

Will do a follow-up review to deep dive re: exception propagation.

{
if (!columnStatistics.isEmpty()) {
if (statisticsUpdates.stream().anyMatch(update -> !update.getColumnStatistics().isEmpty())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super duper nit: Converting to noneMatch would allow using static reference to the method and prevent the double-negation type thing here.

@alexjo2144 alexjo2144 force-pushed the parallelize-analyze branch 3 times, most recently from 77d05b5 to 14146a9 Compare August 12, 2021 18:25
@alexjo2144
Copy link
Member Author

Comments addressed, and hopefully code improved a bit. The big thing was adding support for batch/async get statistics calls, which simplified statistics updating a bit. @hashhar @losipiuk

Set<Partition> partitions = batchGetPartitionResult.getPartitions().stream().map(partitionConverter).collect(toImmutableSet());
Map<Partition, Map<String, HiveColumnStatistics>> statisticsPerPartition = columnStatisticsProvider.getPartitionColumnStatistics(partitions);

statisticsPerPartition.forEach((partition, columnStatistics) -> {
Copy link
Member

@losipiuk losipiuk Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check. If there are no statistics present in Glue for a partition we will still get an entry in map returned by columnStatisticsProvider.getPartitionColumnStatistics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM. some nits

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question about exceptions. Looks good % Lukasz's comments.

@losipiuk Reminder to ensure Glue tests get run.

@losipiuk
Copy link
Member

@losipiuk Reminder to ensure Glue tests get run.

good point. I will send out draft PR from origin to trigger tests.

@losipiuk losipiuk mentioned this pull request Aug 13, 2021
@alexjo2144
Copy link
Member Author

Nits/comments applied. Thanks

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you also update hive.rst to reflect the change in default values of hive.metastore.glue.write-statistics-threads and hive.metastore.glue.read-statistics-threads?

@losipiuk losipiuk merged commit c10cd01 into trinodb:master Aug 16, 2021
@losipiuk losipiuk mentioned this pull request Aug 16, 2021
10 tasks
@ebyhr ebyhr added this to the 361 milestone Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants