-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KYUUBI #4171] Support skip retrieving table's properties to speed up GetTables operation #4444
Conversation
@@ -2713,4 +2713,11 @@ object KyuubiConf { | |||
.version("1.7.0") | |||
.timeConf | |||
.createWithDefault(Duration.ofSeconds(60).toMillis) | |||
|
|||
val ENGINE_SPARK_LIST_TABLES: ConfigEntry[Boolean] = | |||
buildConf("kyuubi.engine.spark.list.tables") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using kyuubi.operation.getTables.ignoreTableProperties
here, because
- technically, this optimization can be applied to all engines.
- back to the purpose of this PR, it aims to avoid the call of retrieving each table's properties to speed up the GetTables operation, for Spark, only "comment" is taken here, but it's better to make the configuration more generic, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. This can be a generic configuration and kyuubi.operation.getTables.ignoreTableProperties
looks more reasonable, but the relational implement maybe different in other engine(Flink, Trino, eg), which can open another issue to follow up.
@@ -166,10 +167,12 @@ class CatalogShim_v3_0 extends CatalogShim_v2_4 { | |||
val identifiers = namespaces.flatMap { ns => | |||
tc.listTables(ns).filter(i => tp.matcher(quoteIfNeeded(i.name())).matches()) | |||
} | |||
val listTablesOnly = spark.conf.getOption(KyuubiConf.ENGINE_SPARK_LIST_TABLES.key) | |||
.map(_.toBoolean).getOrElse(KyuubiConf.ENGINE_SPARK_LIST_TABLES.defaultVal.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does not consider the kyuubi session conf, I suggest evaluating the flag in the GetTables
, you can refer to SparkOperation#operationSparkListenerEnabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll have a look.
|
||
val ENGINE_SPARK_LIST_TABLES: ConfigEntry[Boolean] = | ||
buildConf("kyuubi.engine.spark.list.tables") | ||
.doc("Only query table identifiers when set to true. Work on Spark 3.x only.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the description to match the configuration name, and we don't need to mention Spark 3.x since Kyuubi only supports Spark 3.1 and above
Codecov Report
@@ Coverage Diff @@
## master #4444 +/- ##
============================================
- Coverage 53.28% 53.23% -0.05%
Complexity 13 13
============================================
Files 569 569
Lines 31146 31171 +25
Branches 4208 4210 +2
============================================
- Hits 16595 16594 -1
- Misses 12980 13000 +20
- Partials 1571 1577 +6
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Thanks, merged to master |
# 🔍 Description ## Issue References 🔗 This pull request aims to speed up the GetTables operation for the Spark session catalog. As reported in #4956, #5949, the GetTables operation is quite slow in some cases, and in #4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog. ## Describe Your Solution 🔧 Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog. Currently, the basic step of GetTables in the Spark engine is ``` val catalog: String = getCatalog(spark, catalogName) val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern) val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false) val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers) ``` then filter `tableObjects` with `tableTypes: Set[String]`. The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand. For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.) ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 Pass GA --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes #6018 from pan3793/fast-get-table. Closes #6018 058001c [Cheng Pan] fix 405b124 [Cheng Pan] fix 615b747 [Cheng Pan] Speed up GetTables operation Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>
# 🔍 Description ## Issue References 🔗 This pull request aims to speed up the GetTables operation for the Spark session catalog. As reported in #4956, #5949, the GetTables operation is quite slow in some cases, and in #4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog. ## Describe Your Solution 🔧 Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog. Currently, the basic step of GetTables in the Spark engine is ``` val catalog: String = getCatalog(spark, catalogName) val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern) val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false) val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers) ``` then filter `tableObjects` with `tableTypes: Set[String]`. The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand. For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.) ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 Pass GA --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes #6018 from pan3793/fast-get-table. Closes #6018 058001c [Cheng Pan] fix 405b124 [Cheng Pan] fix 615b747 [Cheng Pan] Speed up GetTables operation Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]> (cherry picked from commit d474768) Signed-off-by: Cheng Pan <[email protected]>
…atalog # 🔍 Description ## Issue References 🔗 This pull request aims to speed up the GetTables operation for the Spark session catalog. As reported in apache#4956, apache#5949, the GetTables operation is quite slow in some cases, and in apache#4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog. ## Describe Your Solution 🔧 Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog. Currently, the basic step of GetTables in the Spark engine is ``` val catalog: String = getCatalog(spark, catalogName) val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern) val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false) val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers) ``` then filter `tableObjects` with `tableTypes: Set[String]`. The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand. For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.) ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 Pass GA --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes apache#6018 from pan3793/fast-get-table. Closes apache#6018 058001c [Cheng Pan] fix 405b124 [Cheng Pan] fix 615b747 [Cheng Pan] Speed up GetTables operation Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>
…atalog # 🔍 Description ## Issue References 🔗 This pull request aims to speed up the GetTables operation for the Spark session catalog. As reported in apache#4956, apache#5949, the GetTables operation is quite slow in some cases, and in apache#4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog. ## Describe Your Solution 🔧 Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog. Currently, the basic step of GetTables in the Spark engine is ``` val catalog: String = getCatalog(spark, catalogName) val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern) val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false) val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers) ``` then filter `tableObjects` with `tableTypes: Set[String]`. The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand. For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.) ## Types of changes 🔖 - [ ] Bugfix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 Pass GA --- # Checklist 📝 - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) **Be nice. Be informative.** Closes apache#6018 from pan3793/fast-get-table. Closes apache#6018 058001c [Cheng Pan] fix 405b124 [Cheng Pan] fix 615b747 [Cheng Pan] Speed up GetTables operation Authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>
Why are the changes needed?
GetTables
operation is too slow because it queries table details info one by one, but then only a table comment is used to construct a result row, which i think could be optional.This PR add an optional config which can control this operation. By default,
GetTables
operation queries all message. Otherwise,GetTables
operation just return table identifiers.How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before make a pull request