Speed up GetTables operation for Spark session catalog #6018

pan3793 · 2024-01-25T14:47:11Z

🔍 Description

Issue References 🔗

This pull request aims to speed up the GetTables operation for the Spark session catalog.
As reported in #4956, #5949, the GetTables operation is quite slow in some cases, and in #4444, kyuubi.operation.getTables.ignoreTableProperties was introduced to speed up the V2 catalog, but not covers session catalog.

Describe Your Solution 🔧

Extend the scope of kyuubi.operation.getTables.ignoreTableProperties to cover the GetTables operation for the Spark session catalog.

Currently, the basic step of GetTables in the Spark engine is

val catalog: String = getCatalog(spark, catalogName)
val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern)
val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers)

then filter tableObjects with tableTypes: Set[String].

The cost of catalog.getTablesByName(identifiers) is quite high when the table number is large, e.g. dozen thousand.

For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when kyuubi.operation.getTables.ignoreTableProperties=true, criteria tableTypes will be ignored, and all tables and views will be treated as TABLE to return.)

Types of changes 🔖

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Pass GA

Checklist 📝

This patch was not authored or co-authored using Generative Tooling

Be nice. Be informative.

codecov-commenter · 2024-01-25T16:21:02Z

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (47a1091) 61.19% compared to head (058001c) 61.11%.
Report is 2 commits behind head on master.

Files	Patch %	Lines
...e/kyuubi/engine/spark/util/SparkCatalogUtils.scala	69.56%	5 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #6018      +/-   ##
============================================
- Coverage     61.19%   61.11%   -0.09%     
  Complexity       23       23              
============================================
  Files           623      623              
  Lines         37060    37103      +43     
  Branches       5024     5029       +5     
============================================
- Hits          22680    22674       -6     
- Misses        11948    11988      +40     
- Partials       2432     2441       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# 🔍 Description ## Issue References 🔗 This pull request aims to speed up the GetTables operation for the Spark session catalog. As reported in #4956, #5949, the GetTables operation is quite slow in some cases, and in #4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog. ## Describe Your Solution 🔧 Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog. Currently, the basic step of GetTables in the Spark engine is ``` val catalog: String = getCatalog(spark, catalogName) val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern) val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false) val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers) ``` then filter `tableObjects` with `tableTypes: Set[String]`. The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand. For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.) ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 Pass GA --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes #6018 from pan3793/fast-get-table. Closes #6018 058001c [Cheng Pan] fix 405b124 [Cheng Pan] fix 615b747 [Cheng Pan] Speed up GetTables operation Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]> (cherry picked from commit d474768) Signed-off-by: Cheng Pan <[email protected]>

pan3793 · 2024-01-29T06:21:58Z

Merged to master/1.8

…atalog # 🔍 Description ## Issue References 🔗 This pull request aims to speed up the GetTables operation for the Spark session catalog. As reported in apache#4956, apache#5949, the GetTables operation is quite slow in some cases, and in apache#4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog. ## Describe Your Solution 🔧 Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog. Currently, the basic step of GetTables in the Spark engine is ``` val catalog: String = getCatalog(spark, catalogName) val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern) val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false) val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers) ``` then filter `tableObjects` with `tableTypes: Set[String]`. The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand. For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.) ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 Pass GA --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes apache#6018 from pan3793/fast-get-table. Closes apache#6018 058001c [Cheng Pan] fix 405b124 [Cheng Pan] fix 615b747 [Cheng Pan] Speed up GetTables operation Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>

…se notes # 🔍 Description ## Issue References 🔗 Currently, we use a rather primitive way to manually write release notes from scratch, and some of the mechanical and repetitive work can be simplified by the scripts. ## Describe Your Solution 🔧 Adds a script to simplify the process of creating release notes. Note: it just simplifies some processes, the release manager still needs to tune the outputs by hand. ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 ``` RELEASE_TAG=v1.8.1 PREVIOUS_RELEASE_TAG=v1.8.0 build/release/pre_gen_release_notes.py ``` ``` $ head build/release/commits-v1.8.1.txt [KYUUBI #5981] Deploy Spark Hive connector with Scala 2.13 to Maven Central [KYUUBI #6058] Make Jetty server stop timeout configurable [KYUUBI #5952][1.8] Disconnect connections without running operations after engine maxlife time graceful period [KYUUBI #6048] Assign serviceNode and add volatile for variables [KYUUBI #5991] Error on reading Atlas properties composed of multi values [KYUUBI #6045] [REST] Sync the AdminRestApi with the AdminResource Apis [KYUUBI #6047] [CI] Free up disk space [KYUUBI #6036] JDBC driver conditional sets fetchSize on opening session [KYUUBI #6028] Exited spark-submit process should not block batch submit queue [KYUUBI #6018] Speed up GetTables operation for Spark session catalog ``` ``` $ head build/release/contributors-v1.8.1.txt * Shaoyun Chen -- [KYUUBI #5857][KYUUBI #5720][KYUUBI #5785][KYUUBI #5617] * Chao Chen -- [KYUUBI #5750] * Flyangz -- [KYUUBI #5832] * Pengqi Li -- [KYUUBI #5713] * Bowen Liang -- [KYUUBI #5730][KYUUBI #5802][KYUUBI #5767][KYUUBI #5831][KYUUBI #5801][KYUUBI #5754][KYUUBI #5626][KYUUBI #5811][KYUUBI #5853][KYUUBI #5765] * Paul Lin -- [KYUUBI #5799][KYUUBI #5814] * Senmiao Liu -- [KYUUBI #5969][KYUUBI #5244] * Xiao Liu -- [KYUUBI #5962] * Peiyue Liu -- [KYUUBI #5331] * Junjie Ma -- [KYUUBI #5789] ``` --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes #6074 from pan3793/release-script. Closes #6074 3d5ec20 [Cheng Pan] credits 1765279 [Cheng Pan] Add a script to simplify the process of creating release notes Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>

…atalog # 🔍 Description ## Issue References 🔗 This pull request aims to speed up the GetTables operation for the Spark session catalog. As reported in apache#4956, apache#5949, the GetTables operation is quite slow in some cases, and in apache#4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog. ## Describe Your Solution 🔧 Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog. Currently, the basic step of GetTables in the Spark engine is ``` val catalog: String = getCatalog(spark, catalogName) val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern) val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false) val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers) ``` then filter `tableObjects` with `tableTypes: Set[String]`. The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand. For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.) ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 Pass GA --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes apache#6018 from pan3793/fast-get-table. Closes apache#6018 058001c [Cheng Pan] fix 405b124 [Cheng Pan] fix 615b747 [Cheng Pan] Speed up GetTables operation Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>

… release notes # 🔍 Description ## Issue References 🔗 Currently, we use a rather primitive way to manually write release notes from scratch, and some of the mechanical and repetitive work can be simplified by the scripts. ## Describe Your Solution 🔧 Adds a script to simplify the process of creating release notes. Note: it just simplifies some processes, the release manager still needs to tune the outputs by hand. ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 ``` RELEASE_TAG=v1.8.1 PREVIOUS_RELEASE_TAG=v1.8.0 build/release/pre_gen_release_notes.py ``` ``` $ head build/release/commits-v1.8.1.txt [KYUUBI apache#5981] Deploy Spark Hive connector with Scala 2.13 to Maven Central [KYUUBI apache#6058] Make Jetty server stop timeout configurable [KYUUBI apache#5952][1.8] Disconnect connections without running operations after engine maxlife time graceful period [KYUUBI apache#6048] Assign serviceNode and add volatile for variables [KYUUBI apache#5991] Error on reading Atlas properties composed of multi values [KYUUBI apache#6045] [REST] Sync the AdminRestApi with the AdminResource Apis [KYUUBI apache#6047] [CI] Free up disk space [KYUUBI apache#6036] JDBC driver conditional sets fetchSize on opening session [KYUUBI apache#6028] Exited spark-submit process should not block batch submit queue [KYUUBI apache#6018] Speed up GetTables operation for Spark session catalog ``` ``` $ head build/release/contributors-v1.8.1.txt * Shaoyun Chen -- [KYUUBI apache#5857][KYUUBI apache#5720][KYUUBI apache#5785][KYUUBI apache#5617] * Chao Chen -- [KYUUBI apache#5750] * Flyangz -- [KYUUBI apache#5832] * Pengqi Li -- [KYUUBI apache#5713] * Bowen Liang -- [KYUUBI apache#5730][KYUUBI apache#5802][KYUUBI apache#5767][KYUUBI apache#5831][KYUUBI apache#5801][KYUUBI apache#5754][KYUUBI apache#5626][KYUUBI apache#5811][KYUUBI apache#5853][KYUUBI apache#5765] * Paul Lin -- [KYUUBI apache#5799][KYUUBI apache#5814] * Senmiao Liu -- [KYUUBI apache#5969][KYUUBI apache#5244] * Xiao Liu -- [KYUUBI apache#5962] * Peiyue Liu -- [KYUUBI apache#5331] * Junjie Ma -- [KYUUBI apache#5789] ``` --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes apache#6074 from pan3793/release-script. Closes apache#6074 3d5ec20 [Cheng Pan] credits 1765279 [Cheng Pan] Add a script to simplify the process of creating release notes Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>

Speed up GetTables operation

615b747

github-actions bot added kind:documentation Documentation is a feature! module:spark module:common labels Jan 25, 2024

fix

405b124

fix

058001c

pan3793 requested review from cxzl25 and cfmcgrady January 29, 2024 05:49

cxzl25 approved these changes Jan 29, 2024

View reviewed changes

pan3793 self-assigned this Jan 29, 2024

pan3793 added this to the v1.8.1 milestone Jan 29, 2024

pan3793 closed this in d474768 Jan 29, 2024

pan3793 mentioned this pull request Nov 25, 2024

[Bug] DataGrip executes refresh command slowly #6797

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up GetTables operation for Spark session catalog #6018

Speed up GetTables operation for Spark session catalog #6018

pan3793 commented Jan 25, 2024 •

edited

Loading

codecov-commenter commented Jan 25, 2024 •

edited

Loading

pan3793 commented Jan 29, 2024

Speed up GetTables operation for Spark session catalog #6018

Speed up GetTables operation for Spark session catalog #6018

Conversation

pan3793 commented Jan 25, 2024 • edited Loading

🔍 Description

Issue References 🔗

Describe Your Solution 🔧

Types of changes 🔖

Test Plan 🧪

Checklist 📝

codecov-commenter commented Jan 25, 2024 • edited Loading

Codecov Report

pan3793 commented Jan 29, 2024

pan3793 commented Jan 25, 2024 •

edited

Loading

codecov-commenter commented Jan 25, 2024 •

edited

Loading