Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up GetTables operation for Spark session catalog #6018

Closed
wants to merge 3 commits into from

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented Jan 25, 2024

🔍 Description

Issue References 🔗

This pull request aims to speed up the GetTables operation for the Spark session catalog.
As reported in #4956, #5949, the GetTables operation is quite slow in some cases, and in #4444, kyuubi.operation.getTables.ignoreTableProperties was introduced to speed up the V2 catalog, but not covers session catalog.

Describe Your Solution 🔧

Extend the scope of kyuubi.operation.getTables.ignoreTableProperties to cover the GetTables operation for the Spark session catalog.

Currently, the basic step of GetTables in the Spark engine is

val catalog: String = getCatalog(spark, catalogName)
val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern)
val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers)

then filter tableObjects with tableTypes: Set[String].

The cost of catalog.getTablesByName(identifiers) is quite high when the table number is large, e.g. dozen thousand.

For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when kyuubi.operation.getTables.ignoreTableProperties=true, criteria tableTypes will be ignored, and all tables and views will be treated as TABLE to return.)

Types of changes 🔖

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Pass GA


Checklist 📝

Be nice. Be informative.

@codecov-commenter
Copy link

codecov-commenter commented Jan 25, 2024

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (47a1091) 61.19% compared to head (058001c) 61.11%.
Report is 2 commits behind head on master.

Files Patch % Lines
...e/kyuubi/engine/spark/util/SparkCatalogUtils.scala 69.56% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #6018      +/-   ##
============================================
- Coverage     61.19%   61.11%   -0.09%     
  Complexity       23       23              
============================================
  Files           623      623              
  Lines         37060    37103      +43     
  Branches       5024     5029       +5     
============================================
- Hits          22680    22674       -6     
- Misses        11948    11988      +40     
- Partials       2432     2441       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pan3793 pan3793 requested review from cxzl25 and cfmcgrady January 29, 2024 05:49
@pan3793 pan3793 self-assigned this Jan 29, 2024
@pan3793 pan3793 added this to the v1.8.1 milestone Jan 29, 2024
@pan3793 pan3793 closed this in d474768 Jan 29, 2024
pan3793 added a commit that referenced this pull request Jan 29, 2024
# 🔍 Description
## Issue References 🔗

This pull request aims to speed up the GetTables operation for the Spark session catalog.
As reported in #4956, #5949, the GetTables operation is quite slow in some cases, and in #4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog.

## Describe Your Solution 🔧

Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog.

Currently, the basic step of GetTables in the Spark engine is
```
val catalog: String = getCatalog(spark, catalogName)
val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern)
val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers)
```
then filter `tableObjects` with `tableTypes: Set[String]`.

The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand.

For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.)

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

Pass GA

---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #6018 from pan3793/fast-get-table.

Closes #6018

058001c [Cheng Pan] fix
405b124 [Cheng Pan] fix
615b747 [Cheng Pan] Speed up GetTables operation

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
(cherry picked from commit d474768)
Signed-off-by: Cheng Pan <[email protected]>
@pan3793
Copy link
Member Author

pan3793 commented Jan 29, 2024

Merged to master/1.8

zhaohehuhu pushed a commit to zhaohehuhu/incubator-kyuubi that referenced this pull request Feb 5, 2024
…atalog

# 🔍 Description
## Issue References 🔗

This pull request aims to speed up the GetTables operation for the Spark session catalog.
As reported in apache#4956, apache#5949, the GetTables operation is quite slow in some cases, and in apache#4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog.

## Describe Your Solution 🔧

Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog.

Currently, the basic step of GetTables in the Spark engine is
```
val catalog: String = getCatalog(spark, catalogName)
val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern)
val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers)
```
then filter `tableObjects` with `tableTypes: Set[String]`.

The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand.

For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.)

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

Pass GA

---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes apache#6018 from pan3793/fast-get-table.

Closes apache#6018

058001c [Cheng Pan] fix
405b124 [Cheng Pan] fix
615b747 [Cheng Pan] Speed up GetTables operation

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
pan3793 added a commit that referenced this pull request Feb 22, 2024
…se notes

# 🔍 Description
## Issue References 🔗

Currently, we use a rather primitive way to manually write release notes from scratch, and some of the mechanical and repetitive work can be simplified by the scripts.

## Describe Your Solution 🔧

Adds a script to simplify the process of creating release notes.

Note: it just simplifies some processes, the release manager still needs to tune the outputs by hand.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

```
RELEASE_TAG=v1.8.1 PREVIOUS_RELEASE_TAG=v1.8.0 build/release/pre_gen_release_notes.py
```

```
$ head build/release/commits-v1.8.1.txt
[KYUUBI #5981] Deploy Spark Hive connector with Scala 2.13 to Maven Central
[KYUUBI #6058] Make Jetty server stop timeout configurable
[KYUUBI #5952][1.8] Disconnect connections without running operations after engine maxlife time graceful period
[KYUUBI #6048] Assign serviceNode and add volatile for variables
[KYUUBI #5991] Error on reading Atlas properties composed of multi values
[KYUUBI #6045] [REST] Sync the AdminRestApi with the AdminResource Apis
[KYUUBI #6047] [CI] Free up disk space
[KYUUBI #6036] JDBC driver conditional sets fetchSize on opening session
[KYUUBI #6028] Exited spark-submit process should not block batch submit queue
[KYUUBI #6018] Speed up GetTables operation for Spark session catalog
```

```
$ head build/release/contributors-v1.8.1.txt
* Shaoyun Chen        -- [KYUUBI #5857][KYUUBI #5720][KYUUBI #5785][KYUUBI #5617]
* Chao Chen           -- [KYUUBI #5750]
* Flyangz             -- [KYUUBI #5832]
* Pengqi Li           -- [KYUUBI #5713]
* Bowen Liang         -- [KYUUBI #5730][KYUUBI #5802][KYUUBI #5767][KYUUBI #5831][KYUUBI #5801][KYUUBI #5754][KYUUBI #5626][KYUUBI #5811][KYUUBI #5853][KYUUBI #5765]
* Paul Lin            -- [KYUUBI #5799][KYUUBI #5814]
* Senmiao Liu         -- [KYUUBI #5969][KYUUBI #5244]
* Xiao Liu            -- [KYUUBI #5962]
* Peiyue Liu          -- [KYUUBI #5331]
* Junjie Ma           -- [KYUUBI #5789]
```
---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #6074 from pan3793/release-script.

Closes #6074

3d5ec20 [Cheng Pan] credits
1765279 [Cheng Pan] Add a script to simplify the process of creating release notes

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
zhaohehuhu pushed a commit to zhaohehuhu/incubator-kyuubi that referenced this pull request Mar 21, 2024
…atalog

# 🔍 Description
## Issue References 🔗

This pull request aims to speed up the GetTables operation for the Spark session catalog.
As reported in apache#4956, apache#5949, the GetTables operation is quite slow in some cases, and in apache#4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog.

## Describe Your Solution 🔧

Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog.

Currently, the basic step of GetTables in the Spark engine is
```
val catalog: String = getCatalog(spark, catalogName)
val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern)
val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers)
```
then filter `tableObjects` with `tableTypes: Set[String]`.

The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand.

For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.)

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

Pass GA

---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes apache#6018 from pan3793/fast-get-table.

Closes apache#6018

058001c [Cheng Pan] fix
405b124 [Cheng Pan] fix
615b747 [Cheng Pan] Speed up GetTables operation

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
zhaohehuhu pushed a commit to zhaohehuhu/incubator-kyuubi that referenced this pull request Mar 21, 2024
… release notes

# 🔍 Description
## Issue References 🔗

Currently, we use a rather primitive way to manually write release notes from scratch, and some of the mechanical and repetitive work can be simplified by the scripts.

## Describe Your Solution 🔧

Adds a script to simplify the process of creating release notes.

Note: it just simplifies some processes, the release manager still needs to tune the outputs by hand.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

```
RELEASE_TAG=v1.8.1 PREVIOUS_RELEASE_TAG=v1.8.0 build/release/pre_gen_release_notes.py
```

```
$ head build/release/commits-v1.8.1.txt
[KYUUBI apache#5981] Deploy Spark Hive connector with Scala 2.13 to Maven Central
[KYUUBI apache#6058] Make Jetty server stop timeout configurable
[KYUUBI apache#5952][1.8] Disconnect connections without running operations after engine maxlife time graceful period
[KYUUBI apache#6048] Assign serviceNode and add volatile for variables
[KYUUBI apache#5991] Error on reading Atlas properties composed of multi values
[KYUUBI apache#6045] [REST] Sync the AdminRestApi with the AdminResource Apis
[KYUUBI apache#6047] [CI] Free up disk space
[KYUUBI apache#6036] JDBC driver conditional sets fetchSize on opening session
[KYUUBI apache#6028] Exited spark-submit process should not block batch submit queue
[KYUUBI apache#6018] Speed up GetTables operation for Spark session catalog
```

```
$ head build/release/contributors-v1.8.1.txt
* Shaoyun Chen        -- [KYUUBI apache#5857][KYUUBI apache#5720][KYUUBI apache#5785][KYUUBI apache#5617]
* Chao Chen           -- [KYUUBI apache#5750]
* Flyangz             -- [KYUUBI apache#5832]
* Pengqi Li           -- [KYUUBI apache#5713]
* Bowen Liang         -- [KYUUBI apache#5730][KYUUBI apache#5802][KYUUBI apache#5767][KYUUBI apache#5831][KYUUBI apache#5801][KYUUBI apache#5754][KYUUBI apache#5626][KYUUBI apache#5811][KYUUBI apache#5853][KYUUBI apache#5765]
* Paul Lin            -- [KYUUBI apache#5799][KYUUBI apache#5814]
* Senmiao Liu         -- [KYUUBI apache#5969][KYUUBI apache#5244]
* Xiao Liu            -- [KYUUBI apache#5962]
* Peiyue Liu          -- [KYUUBI apache#5331]
* Junjie Ma           -- [KYUUBI apache#5789]
```
---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes apache#6074 from pan3793/release-script.

Closes apache#6074

3d5ec20 [Cheng Pan] credits
1765279 [Cheng Pan] Add a script to simplify the process of creating release notes

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants