[KYUUBI #4171] Support skip retrieving table's properties to speed up GetTables operation #4444

liaoyt · 2023-03-04T05:53:06Z

Why are the changes needed?

GetTables operation is too slow because it queries table details info one by one, but then only a table comment is used to construct a result row, which i think could be optional.
This PR add an optional config which can control this operation. By default, GetTables operation queries all message. Otherwise, GetTables operation just return table identifiers.

How was this patch tested?

Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before make a pull request

pan3793 · 2023-03-04T08:01:57Z

kyuubi-common/src/main/scala/org/apache/kyuubi/config/KyuubiConf.scala

@@ -2713,4 +2713,11 @@ object KyuubiConf {
      .version("1.7.0")
      .timeConf
      .createWithDefault(Duration.ofSeconds(60).toMillis)
+
+  val ENGINE_SPARK_LIST_TABLES: ConfigEntry[Boolean] =
+    buildConf("kyuubi.engine.spark.list.tables")


I suggest using kyuubi.operation.getTables.ignoreTableProperties here, because

technically, this optimization can be applied to all engines.

back to the purpose of this PR, it aims to avoid the call of retrieving each table's properties to speed up the GetTables operation, for Spark, only "comment" is taken here, but it's better to make the configuration more generic, right?

Good suggestion. This can be a generic configuration and kyuubi.operation.getTables.ignoreTableProperties looks more reasonable, but the relational implement maybe different in other engine(Flink, Trino, eg), which can open another issue to follow up.

pan3793 · 2023-03-04T08:14:10Z

...i-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/shim/CatalogShim_v3_0.scala

@@ -166,10 +167,12 @@ class CatalogShim_v3_0 extends CatalogShim_v2_4 {
        val identifiers = namespaces.flatMap { ns =>
          tc.listTables(ns).filter(i => tp.matcher(quoteIfNeeded(i.name())).matches())
        }
+        val listTablesOnly = spark.conf.getOption(KyuubiConf.ENGINE_SPARK_LIST_TABLES.key)
+          .map(_.toBoolean).getOrElse(KyuubiConf.ENGINE_SPARK_LIST_TABLES.defaultVal.get)


it does not consider the kyuubi session conf, I suggest evaluating the flag in the GetTables, you can refer to SparkOperation#operationSparkListenerEnabled

Thanks, I'll have a look.

pan3793 · 2023-03-04T08:15:17Z

kyuubi-common/src/main/scala/org/apache/kyuubi/config/KyuubiConf.scala

+
+  val ENGINE_SPARK_LIST_TABLES: ConfigEntry[Boolean] =
+    buildConf("kyuubi.engine.spark.list.tables")
+      .doc("Only query table identifiers when set to true. Work on Spark 3.x only.")


please update the description to match the configuration name, and we don't need to mention Spark 3.x since Kyuubi only supports Spark 3.1 and above

codecov-commenter · 2023-03-04T08:59:20Z

Codecov Report

Merging #4444 (0c9985e) into master (355ed80) will decrease coverage by 0.05%.
The diff coverage is 88.88%.

❗ Current head 0c9985e differs from pull request most recent head af5e60e. Consider uploading reports for the commit af5e60e to get more accurate results

@@             Coverage Diff              @@
##             master    #4444      +/-   ##
============================================
- Coverage     53.28%   53.23%   -0.05%     
  Complexity       13       13              
============================================
  Files           569      569              
  Lines         31146    31171      +25     
  Branches       4208     4210       +2     
============================================
- Hits          16595    16594       -1     
- Misses        12980    13000      +20     
- Partials       1571     1577       +6

Impacted Files	Coverage Δ
...he/kyuubi/engine/spark/shim/CatalogShim_v3_0.scala	`79.34% <75.00%> (-0.66%)`	⬇️
...in/scala/org/apache/kyuubi/config/KyuubiConf.scala	`97.39% <100.00%> (+<0.01%)`	⬆️
...kyuubi/server/trino/api/v1/StatementResource.scala	`50.66% <0.00%> (-8.00%)`	⬇️
...uubi/engine/spark/events/SparkOperationEvent.scala	`88.88% <0.00%> (-5.56%)`	⬇️
...rg/apache/kyuubi/server/api/v1/AdminResource.scala	`84.55% <0.00%> (-2.84%)`	⬇️
...in/spark/authz/ranger/SparkRangerAdminPlugin.scala	`64.47% <0.00%> (-2.64%)`	⬇️
...g/apache/kyuubi/operation/BatchJobSubmission.scala	`75.27% <0.00%> (-2.20%)`	⬇️
...n/scala/org/apache/kyuubi/KyuubiSQLException.scala	`91.56% <0.00%> (-1.21%)`	⬇️
.../apache/kyuubi/client/KyuubiSyncThriftClient.scala	`89.49% <0.00%> (-0.85%)`	⬇️
...n/scala/org/apache/kyuubi/engine/ProcBuilder.scala	`78.39% <0.00%> (-0.62%)`	⬇️
... and 2 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pan3793 · 2023-03-06T02:39:24Z

Thanks, merged to master

# 🔍 Description ## Issue References 🔗 This pull request aims to speed up the GetTables operation for the Spark session catalog. As reported in #4956, #5949, the GetTables operation is quite slow in some cases, and in #4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog. ## Describe Your Solution 🔧 Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog. Currently, the basic step of GetTables in the Spark engine is ``` val catalog: String = getCatalog(spark, catalogName) val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern) val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false) val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers) ``` then filter `tableObjects` with `tableTypes: Set[String]`. The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand. For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.) ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 Pass GA --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes #6018 from pan3793/fast-get-table. Closes #6018 058001c [Cheng Pan] fix 405b124 [Cheng Pan] fix 615b747 [Cheng Pan] Speed up GetTables operation Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>

# 🔍 Description ## Issue References 🔗 This pull request aims to speed up the GetTables operation for the Spark session catalog. As reported in #4956, #5949, the GetTables operation is quite slow in some cases, and in #4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog. ## Describe Your Solution 🔧 Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog. Currently, the basic step of GetTables in the Spark engine is ``` val catalog: String = getCatalog(spark, catalogName) val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern) val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false) val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers) ``` then filter `tableObjects` with `tableTypes: Set[String]`. The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand. For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.) ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 Pass GA --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes #6018 from pan3793/fast-get-table. Closes #6018 058001c [Cheng Pan] fix 405b124 [Cheng Pan] fix 615b747 [Cheng Pan] Speed up GetTables operation Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]> (cherry picked from commit d474768) Signed-off-by: Cheng Pan <[email protected]>

…atalog # 🔍 Description ## Issue References 🔗 This pull request aims to speed up the GetTables operation for the Spark session catalog. As reported in apache#4956, apache#5949, the GetTables operation is quite slow in some cases, and in apache#4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog. ## Describe Your Solution 🔧 Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog. Currently, the basic step of GetTables in the Spark engine is ``` val catalog: String = getCatalog(spark, catalogName) val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern) val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false) val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers) ``` then filter `tableObjects` with `tableTypes: Set[String]`. The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand. For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.) ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 Pass GA --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes apache#6018 from pan3793/fast-get-table. Closes apache#6018 058001c [Cheng Pan] fix 405b124 [Cheng Pan] fix 615b747 [Cheng Pan] Speed up GetTables operation Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>

Supports ignore table comment when list all tables.

5e8687c

github-actions bot added module:common module:spark labels Mar 4, 2023

add doc

0c9985e

github-actions bot added the kind:documentation Documentation is a feature! label Mar 4, 2023

pan3793 reviewed Mar 4, 2023

View reviewed changes

liaoyt changed the title ~~[KYUUBI #4171] Supports ignore table comment when list all tables.~~ [WIP][KYUUBI #4171] Supports ignore table comment when list all tables. Mar 5, 2023

rename config

af5e60e

liaoyt changed the title ~~[WIP][KYUUBI #4171] Supports ignore table comment when list all tables.~~ [KYUUBI #4171] Supports ignore table comment when list all tables. Mar 6, 2023

liaoyt requested a review from pan3793 March 6, 2023 01:43

pan3793 approved these changes Mar 6, 2023

View reviewed changes

pan3793 changed the title ~~[KYUUBI #4171] Supports ignore table comment when list all tables.~~ [KYUUBI #4171] Support skip retrieving table's properties to speed up GetTables operation Mar 6, 2023

pan3793 assigned liaoyt Mar 6, 2023

pan3793 added this to the v1.8.0 milestone Mar 6, 2023

pan3793 closed this in b40fea2 Mar 6, 2023

pan3793 mentioned this pull request Jan 25, 2024

Speed up GetTables operation for Spark session catalog #6018

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KYUUBI #4171] Support skip retrieving table's properties to speed up GetTables operation #4444

[KYUUBI #4171] Support skip retrieving table's properties to speed up GetTables operation #4444

liaoyt commented Mar 4, 2023

pan3793 Mar 4, 2023

liaoyt Mar 5, 2023

pan3793 Mar 4, 2023

liaoyt Mar 5, 2023

pan3793 Mar 4, 2023 •

edited

Loading

codecov-commenter commented Mar 4, 2023 •

edited

Loading

pan3793 commented Mar 6, 2023

[KYUUBI #4171] Support skip retrieving table's properties to speed up GetTables operation #4444

[KYUUBI #4171] Support skip retrieving table's properties to speed up GetTables operation #4444

Conversation

liaoyt commented Mar 4, 2023

Why are the changes needed?

How was this patch tested?

pan3793 Mar 4, 2023

Choose a reason for hiding this comment

liaoyt Mar 5, 2023

Choose a reason for hiding this comment

pan3793 Mar 4, 2023

Choose a reason for hiding this comment

liaoyt Mar 5, 2023

Choose a reason for hiding this comment

pan3793 Mar 4, 2023 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Mar 4, 2023 • edited Loading

Codecov Report

pan3793 commented Mar 6, 2023

pan3793 Mar 4, 2023 •

edited

Loading

codecov-commenter commented Mar 4, 2023 •

edited

Loading