Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Umbrella] Improvements and evaluation for TRowSet generation of Spark Engine #5808

Open
3 of 12 tasks
bowenliang123 opened this issue Dec 3, 2023 · 0 comments
Open
3 of 12 tasks
Labels
kind:umbrella This a umbrella ticket priority:major

Comments

@bowenliang123
Copy link
Contributor

bowenliang123 commented Dec 3, 2023

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the proposal

RowSet generation is that taking the results from result iterator and serializing them into column-based or row-based TRowSet, which is the key point for transportation and performance in most common cases.

  1. It's been reported possibility drawbacks in looping the result iterator by wrapped stream in SparkOperation (https://github.com/apache/kyuubi/blob/master/externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/operation/SparkOperation.scala#L253C21-L253C26)
  2. the performance of Spark Engine's RowSet.toTRowSet should be evaluated by benchmarks, for overall performance and for each data type with different mode of Thrift version and arrow based.
  3. Performance Improvements in Spark Engine's RowSet implementation
  4. Code cleanup in Spark Engine's RowSet generation

Task list

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
  • No. I cannot submit a PR at this time.
@bowenliang123 bowenliang123 added kind:umbrella This a umbrella ticket priority:major labels Dec 3, 2023
@bowenliang123 bowenliang123 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2023
@bowenliang123 bowenliang123 reopened this Dec 6, 2023
pan3793 pushed a commit that referenced this issue Dec 7, 2023
…ization of TRowSet generation

# 🔍 Description
## Issue References 🔗

Subtask of #5808.

## Describe Your Solution 🔧
Value serialization to Hive style string by  `HiveResult.toHiveString`  requires a `TimeFormatters` instance handling the date/time data types. Reusing the pre-created time formatters's instance, it dramatically reduces the overhead and improves the TRowset generation.

This may help to reduce memory footprints and fewer operations for TRowSet generation performance.

This is also aligned to the Spark's `RowSetUtils` in the implementation of RowSet generation for  [SPARK-39041](https://issues.apache.org/jira/browse/SPARK-39041) by yaooqinn , with explicitly declared TimeFormatters instance.

## Types of changes 🔖

- [x] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklists
## 📝 Author Self Checklist

- [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project
- [x] I have performed a self-review
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

## 📝 Committer Pre-Merge Checklist

- [x] Pull request title is okay.
- [x] No license issues.
- [x] Milestone correctly set?
- [x] Test coverage is ok
- [x] Assignees are selected.
- [x] Minimum number of approvals
- [x] No changes are requested

**Be nice. Be informative.**

Closes #5811 from bowenliang123/rowset-timeformatter.

Closes #5811

2270991 [Bowen Liang] Reuse time formatters instance in value serialization of TRowSet

Authored-by: Bowen Liang <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
pan3793 pushed a commit that referenced this issue Dec 7, 2023
…ization of TRowSet generation

# 🔍 Description
## Issue References 🔗

Subtask of #5808.

## Describe Your Solution 🔧
Value serialization to Hive style string by  `HiveResult.toHiveString`  requires a `TimeFormatters` instance handling the date/time data types. Reusing the pre-created time formatters's instance, it dramatically reduces the overhead and improves the TRowset generation.

This may help to reduce memory footprints and fewer operations for TRowSet generation performance.

This is also aligned to the Spark's `RowSetUtils` in the implementation of RowSet generation for  [SPARK-39041](https://issues.apache.org/jira/browse/SPARK-39041) by yaooqinn , with explicitly declared TimeFormatters instance.

## Types of changes 🔖

- [x] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklists
## 📝 Author Self Checklist

- [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project
- [x] I have performed a self-review
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

## 📝 Committer Pre-Merge Checklist

- [x] Pull request title is okay.
- [x] No license issues.
- [x] Milestone correctly set?
- [x] Test coverage is ok
- [x] Assignees are selected.
- [x] Minimum number of approvals
- [x] No changes are requested

**Be nice. Be informative.**

Closes #5811 from bowenliang123/rowset-timeformatter.

Closes #5811

2270991 [Bowen Liang] Reuse time formatters instance in value serialization of TRowSet

Authored-by: Bowen Liang <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
(cherry picked from commit 4463cc8)
Signed-off-by: Cheng Pan <[email protected]>
bowenliang123 added a commit that referenced this issue Dec 8, 2023
…umnValues in TRowSet generation

# 🔍 Description
## Issue References 🔗

Subtask of #5808.

## Describe Your Solution 🔧

To avoid unnecessary possible array copy in ensuring capacity of the array list, pre-allocate array list capacity for TColumns and TColumnValues in TRowSet generation.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉
No behaviour changes.

#### Related Unit Tests
RowSetSuite of Spark Engine.

---

# Checklists
## 📝 Author Self Checklist

- [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project
- [ ] I have performed a self-review
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

## 📝 Committer Pre-Merge Checklist

- [x] Pull request title is okay.
- [x] No license issues.
- [x] Milestone correctly set?
- [x] Test coverage is ok
- [x] Assignees are selected.
- [x] Minimum number of approvals
- [ ] No changes are requested

**Be nice. Be informative.**

Closes #5831 from bowenliang123/rowset-inplace.

Closes #5831

c1c6c0f [liangbowen] avoid possible array copying with growing of array list in TRowSet generation

Lead-authored-by: Bowen Liang <[email protected]>
Co-authored-by: liangbowen <[email protected]>
Signed-off-by: Bowen Liang <[email protected]>
bowenliang123 added a commit that referenced this issue Dec 8, 2023
…umnValues in TRowSet generation

# 🔍 Description
## Issue References 🔗

Subtask of #5808.

## Describe Your Solution 🔧

To avoid unnecessary possible array copy in ensuring capacity of the array list, pre-allocate array list capacity for TColumns and TColumnValues in TRowSet generation.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉
No behaviour changes.

#### Related Unit Tests
RowSetSuite of Spark Engine.

---

# Checklists
## 📝 Author Self Checklist

- [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project
- [ ] I have performed a self-review
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

## 📝 Committer Pre-Merge Checklist

- [x] Pull request title is okay.
- [x] No license issues.
- [x] Milestone correctly set?
- [x] Test coverage is ok
- [x] Assignees are selected.
- [x] Minimum number of approvals
- [ ] No changes are requested

**Be nice. Be informative.**

Closes #5831 from bowenliang123/rowset-inplace.

Closes #5831

c1c6c0f [liangbowen] avoid possible array copying with growing of array list in TRowSet generation

Lead-authored-by: Bowen Liang <[email protected]>
Co-authored-by: liangbowen <[email protected]>
Signed-off-by: Bowen Liang <[email protected]>
(cherry picked from commit 1d6e653)
Signed-off-by: Bowen Liang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:umbrella This a umbrella ticket priority:major
Projects
None yet
Development

No branches or pull requests

1 participant