Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KYUUBI #4171] Support skip retrieving table's properties to speed up GetTables operation #4444

Closed
wants to merge 3 commits into from

Conversation

liaoyt
Copy link
Contributor

@liaoyt liaoyt commented Mar 4, 2023

Why are the changes needed?

GetTables operation is too slow because it queries table details info one by one, but then only a table comment is used to construct a result row, which i think could be optional.
This PR add an optional config which can control this operation. By default, GetTables operation queries all message. Otherwise, GetTables operation just return table identifiers.

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before make a pull request

@github-actions github-actions bot added the kind:documentation Documentation is a feature! label Mar 4, 2023
@@ -2713,4 +2713,11 @@ object KyuubiConf {
.version("1.7.0")
.timeConf
.createWithDefault(Duration.ofSeconds(60).toMillis)

val ENGINE_SPARK_LIST_TABLES: ConfigEntry[Boolean] =
buildConf("kyuubi.engine.spark.list.tables")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using kyuubi.operation.getTables.ignoreTableProperties here, because

  1. technically, this optimization can be applied to all engines.
  2. back to the purpose of this PR, it aims to avoid the call of retrieving each table's properties to speed up the GetTables operation, for Spark, only "comment" is taken here, but it's better to make the configuration more generic, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. This can be a generic configuration and kyuubi.operation.getTables.ignoreTableProperties looks more reasonable, but the relational implement maybe different in other engine(Flink, Trino, eg), which can open another issue to follow up.

@@ -166,10 +167,12 @@ class CatalogShim_v3_0 extends CatalogShim_v2_4 {
val identifiers = namespaces.flatMap { ns =>
tc.listTables(ns).filter(i => tp.matcher(quoteIfNeeded(i.name())).matches())
}
val listTablesOnly = spark.conf.getOption(KyuubiConf.ENGINE_SPARK_LIST_TABLES.key)
.map(_.toBoolean).getOrElse(KyuubiConf.ENGINE_SPARK_LIST_TABLES.defaultVal.get)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not consider the kyuubi session conf, I suggest evaluating the flag in the GetTables, you can refer to SparkOperation#operationSparkListenerEnabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll have a look.


val ENGINE_SPARK_LIST_TABLES: ConfigEntry[Boolean] =
buildConf("kyuubi.engine.spark.list.tables")
.doc("Only query table identifiers when set to true. Work on Spark 3.x only.")
Copy link
Member

@pan3793 pan3793 Mar 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the description to match the configuration name, and we don't need to mention Spark 3.x since Kyuubi only supports Spark 3.1 and above

@codecov-commenter
Copy link

codecov-commenter commented Mar 4, 2023

Codecov Report

Merging #4444 (0c9985e) into master (355ed80) will decrease coverage by 0.05%.
The diff coverage is 88.88%.

❗ Current head 0c9985e differs from pull request most recent head af5e60e. Consider uploading reports for the commit af5e60e to get more accurate results

@@             Coverage Diff              @@
##             master    #4444      +/-   ##
============================================
- Coverage     53.28%   53.23%   -0.05%     
  Complexity       13       13              
============================================
  Files           569      569              
  Lines         31146    31171      +25     
  Branches       4208     4210       +2     
============================================
- Hits          16595    16594       -1     
- Misses        12980    13000      +20     
- Partials       1571     1577       +6     
Impacted Files Coverage Δ
...he/kyuubi/engine/spark/shim/CatalogShim_v3_0.scala 79.34% <75.00%> (-0.66%) ⬇️
...in/scala/org/apache/kyuubi/config/KyuubiConf.scala 97.39% <100.00%> (+<0.01%) ⬆️
...kyuubi/server/trino/api/v1/StatementResource.scala 50.66% <0.00%> (-8.00%) ⬇️
...uubi/engine/spark/events/SparkOperationEvent.scala 88.88% <0.00%> (-5.56%) ⬇️
...rg/apache/kyuubi/server/api/v1/AdminResource.scala 84.55% <0.00%> (-2.84%) ⬇️
...in/spark/authz/ranger/SparkRangerAdminPlugin.scala 64.47% <0.00%> (-2.64%) ⬇️
...g/apache/kyuubi/operation/BatchJobSubmission.scala 75.27% <0.00%> (-2.20%) ⬇️
...n/scala/org/apache/kyuubi/KyuubiSQLException.scala 91.56% <0.00%> (-1.21%) ⬇️
.../apache/kyuubi/client/KyuubiSyncThriftClient.scala 89.49% <0.00%> (-0.85%) ⬇️
...n/scala/org/apache/kyuubi/engine/ProcBuilder.scala 78.39% <0.00%> (-0.62%) ⬇️
... and 2 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@liaoyt liaoyt changed the title [KYUUBI #4171] Supports ignore table comment when list all tables. [WIP][KYUUBI #4171] Supports ignore table comment when list all tables. Mar 5, 2023
@liaoyt liaoyt changed the title [WIP][KYUUBI #4171] Supports ignore table comment when list all tables. [KYUUBI #4171] Supports ignore table comment when list all tables. Mar 6, 2023
@liaoyt liaoyt requested a review from pan3793 March 6, 2023 01:43
@pan3793 pan3793 changed the title [KYUUBI #4171] Supports ignore table comment when list all tables. [KYUUBI #4171] Support skip retrieving table's properties to speed up GetTables operation Mar 6, 2023
@pan3793 pan3793 added this to the v1.8.0 milestone Mar 6, 2023
@pan3793 pan3793 closed this in b40fea2 Mar 6, 2023
@pan3793
Copy link
Member

pan3793 commented Mar 6, 2023

Thanks, merged to master

pan3793 added a commit that referenced this pull request Jan 29, 2024
# 🔍 Description
## Issue References 🔗

This pull request aims to speed up the GetTables operation for the Spark session catalog.
As reported in #4956, #5949, the GetTables operation is quite slow in some cases, and in #4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog.

## Describe Your Solution 🔧

Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog.

Currently, the basic step of GetTables in the Spark engine is
```
val catalog: String = getCatalog(spark, catalogName)
val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern)
val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers)
```
then filter `tableObjects` with `tableTypes: Set[String]`.

The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand.

For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.)

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

Pass GA

---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #6018 from pan3793/fast-get-table.

Closes #6018

058001c [Cheng Pan] fix
405b124 [Cheng Pan] fix
615b747 [Cheng Pan] Speed up GetTables operation

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
pan3793 added a commit that referenced this pull request Jan 29, 2024
# 🔍 Description
## Issue References 🔗

This pull request aims to speed up the GetTables operation for the Spark session catalog.
As reported in #4956, #5949, the GetTables operation is quite slow in some cases, and in #4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog.

## Describe Your Solution 🔧

Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog.

Currently, the basic step of GetTables in the Spark engine is
```
val catalog: String = getCatalog(spark, catalogName)
val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern)
val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers)
```
then filter `tableObjects` with `tableTypes: Set[String]`.

The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand.

For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.)

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

Pass GA

---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #6018 from pan3793/fast-get-table.

Closes #6018

058001c [Cheng Pan] fix
405b124 [Cheng Pan] fix
615b747 [Cheng Pan] Speed up GetTables operation

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
(cherry picked from commit d474768)
Signed-off-by: Cheng Pan <[email protected]>
zhaohehuhu pushed a commit to zhaohehuhu/incubator-kyuubi that referenced this pull request Feb 5, 2024
…atalog

# 🔍 Description
## Issue References 🔗

This pull request aims to speed up the GetTables operation for the Spark session catalog.
As reported in apache#4956, apache#5949, the GetTables operation is quite slow in some cases, and in apache#4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog.

## Describe Your Solution 🔧

Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog.

Currently, the basic step of GetTables in the Spark engine is
```
val catalog: String = getCatalog(spark, catalogName)
val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern)
val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers)
```
then filter `tableObjects` with `tableTypes: Set[String]`.

The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand.

For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.)

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

Pass GA

---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes apache#6018 from pan3793/fast-get-table.

Closes apache#6018

058001c [Cheng Pan] fix
405b124 [Cheng Pan] fix
615b747 [Cheng Pan] Speed up GetTables operation

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
zhaohehuhu pushed a commit to zhaohehuhu/incubator-kyuubi that referenced this pull request Mar 21, 2024
…atalog

# 🔍 Description
## Issue References 🔗

This pull request aims to speed up the GetTables operation for the Spark session catalog.
As reported in apache#4956, apache#5949, the GetTables operation is quite slow in some cases, and in apache#4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog.

## Describe Your Solution 🔧

Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog.

Currently, the basic step of GetTables in the Spark engine is
```
val catalog: String = getCatalog(spark, catalogName)
val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern)
val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers)
```
then filter `tableObjects` with `tableTypes: Set[String]`.

The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand.

For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.)

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

Pass GA

---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes apache#6018 from pan3793/fast-get-table.

Closes apache#6018

058001c [Cheng Pan] fix
405b124 [Cheng Pan] fix
615b747 [Cheng Pan] Speed up GetTables operation

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants