Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#4089] fix(hive catalog): the problem of slow acquisition of hive table list #4469

Merged
merged 8 commits into from
Nov 4, 2024

Conversation

mygrsun
Copy link
Contributor

@mygrsun mygrsun commented Aug 9, 2024

What changes were proposed in this pull request?

the problem of slow acquisition of hive table list.
Using listTableNamesByFilter replace the getTableObjectsByName method.

Why are the changes needed?

I found that list-table will takes 300s when a schema has 5000 tables .

Fix: #4089

Does this PR introduce any user-facing change?

no

How was this patch tested?

Manual testing

if (!listAllTables) {
String filter =
String.format(
"%stable_type = \"ICEBERG\"", hive_metastoreConstants.HIVE_FILTER_FIELD_PARAMS);
Copy link
Contributor

@mchades mchades Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering out only the Iceberg table seems to be insufficient, and we currently also do not support view type.
So I think you should use show managed type table when listAllTables=false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the listTableNamesByFilter can’t support filter by table type.so , this method can't meet the requirement of filting view tables.
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not filter the view, what impact will it have on Gravitino?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not filter the view, what impact will it have on Gravitino?

I have tested it.The view table ,don‘t have any impact ,so i think we can do like this.

Copy link
Contributor Author

@mygrsun mygrsun Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add some comments ,please review again

@mchades
Copy link
Contributor

mchades commented Aug 12, 2024

Hi, @mygrsun could you plz resolve the comment and CI issue? Thanks

@mchades
Copy link
Contributor

mchades commented Aug 19, 2024

@FANNG1 @yuqi1129 Can you help review it?

@mygrsun
Copy link
Contributor Author

mygrsun commented Aug 21, 2024

Now,the ci issue is a troublesome problem. https://github.com/apache/gravitino/actions/runs/10452698362/job/28942732918?pr=4469
This ci issue is due to a bug in hive, and the iceberg project has the same problem.
apache/iceberg#2722 (comment)
image

This hive bug is only happen in Derby. In our environment , the metastore storage is mysql,so we don't encounter this problem.

@mchades
Copy link
Contributor

mchades commented Sep 26, 2024

@mygrsun Is there any progress? May I take on this?

@mygrsun
Copy link
Contributor Author

mygrsun commented Sep 26, 2024

@mygrsun Is there any progress? May I take on this?

thanks,you can do it

@mchades mchades force-pushed the issue_4089_hive_slow branch from a8e877d to ebc4b17 Compare November 1, 2024 06:55
@mchades
Copy link
Contributor

mchades commented Nov 1, 2024

I have verified locally that through this PR, the time consumption of listing 1000 tables can be reduced from 2043ms to 14ms.

It's ready for review now. @jerryshao

@mchades mchades requested a review from jerryshao November 1, 2024 11:55
})
.map(tb -> NameIdentifier.of(namespace, tb.getTableName()))
.toArray(NameIdentifier[]::new));
if (!listAllTables) {
Copy link
Contributor

@jerryshao jerryshao Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mchades I think you should also check Paimon catalog, AFAIK, we have supported Paimon with HMS backend recently. Please sync with @FANNG1 and @caican00 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the source codes of Paimon and tested locally, I updated the PR and use table param table_type=paimon to filter out Paimon tables.

@jerryshao jerryshao added the branch-0.7 Automatically cherry-pick commit to branch-0.7 label Nov 4, 2024
@mchades mchades requested a review from jerryshao November 4, 2024 11:07
clientPool.run(
c ->
c.listTableNamesByFilter(
schemaIdent.name(), icebergAndPaimonFilter, (short) -1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of (short) -1 here? Can you define a constant and add a comment on it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jerryshao jerryshao merged commit 6d05ec4 into apache:main Nov 4, 2024
26 checks passed
github-actions bot pushed a commit that referenced this pull request Nov 4, 2024
…ble list (#4469)

### What changes were proposed in this pull request?

the problem of slow acquisition of hive table list.
Using listTableNamesByFilter replace the getTableObjectsByName method.


### Why are the changes needed?

I found that list-table will takes 300s when a schema has 5000 tables .

Fix: #4089 

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Manual testing

---------

Co-authored-by: ericqin <[email protected]>
Co-authored-by: mchades <[email protected]>
mplmoknijb pushed a commit to mplmoknijb/gravitino that referenced this pull request Nov 6, 2024
…ive table list (apache#4469)

### What changes were proposed in this pull request?

the problem of slow acquisition of hive table list.
Using listTableNamesByFilter replace the getTableObjectsByName method.


### Why are the changes needed?

I found that list-table will takes 300s when a schema has 5000 tables .

Fix: apache#4089 

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Manual testing

---------

Co-authored-by: ericqin <[email protected]>
Co-authored-by: mchades <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-0.7 Automatically cherry-pick commit to branch-0.7
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug report] list-table api is very slow when table quantity is very large
5 participants