Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK] Improvement for DecimalType serialization in toTColumn for column-based TRowSet generation #5810

Closed
wants to merge 2 commits into from

Conversation

bowenliang123
Copy link
Contributor

@bowenliang123 bowenliang123 commented Dec 4, 2023

🔍 Description

Issue References 🔗

Subtask of #5808.

Describe Your Solution 🔧

Improvement for DecimalType serialization in toTColumn for column-based TRowSet generation, by skipping type wrapping inside HiveResult.toHiveString.

Types of changes 🔖

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Column-based TRowSet generation with 3000 rows:
decimalVal 8.935 ms 335.758 rows/ms

Behavior With This Pull Request 🎉

decimalVal 1.135 ms 2643.172 rows/ms

Related Unit Tests

Spark Engine's RowSetSuite


Checklists

📝 Author Self Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • This patch was not authored or co-authored using Generative Tooling

📝 Committer Pre-Merge Checklist

  • Pull request title is okay.
  • No license issues.
  • Milestone correctly set?
  • Test coverage is ok
  • Assignees are selected.
  • Minimum number of approvals
  • No changes are requested

Be nice. Be informative.

@codecov-commenter
Copy link

codecov-commenter commented Dec 4, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2a39d69) 61.35% compared to head (b7a3a2d) 61.34%.
Report is 1 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #5810      +/-   ##
============================================
- Coverage     61.35%   61.34%   -0.01%     
  Complexity       23       23              
============================================
  Files           608      608              
  Lines         35931    35945      +14     
  Branches       4937     4942       +5     
============================================
+ Hits          22045    22050       +5     
- Misses        11494    11495       +1     
- Partials       2392     2400       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@bowenliang123 bowenliang123 changed the title Improvement for DecimalType serialization in toTColumn for column-based TRowSet generation [SPARK] Improvement for DecimalType serialization in toTColumn for column-based TRowSet generation Dec 4, 2023
@wForget
Copy link
Member

wForget commented Dec 4, 2023

Is this still necessary after #5811 is fixed?

@bowenliang123
Copy link
Contributor Author

I think yes. This is skipping falling through to RowSet.toHiveString first, while #5811 is targeted for skipping timeFormatter creation inside RowSet.toHiveString.

@wForget
Copy link
Member

wForget commented Dec 4, 2023

I think yes. This is skipping falling through to RowSet.toHiveString first, while #5811 is targeted for skipping timeFormatter creation inside RowSet.toHiveString.

I think this might not be necessary if the RowSet.toHiveString performance issue is fixed since they all just convert decimal to string.

@bowenliang123
Copy link
Contributor Author

The implementation in this PR hints the Java's BigDecimal as the targeted class , and the value is get as BigDecimal directly. And the default value is also directly apply to nulls.

@bowenliang123
Copy link
Contributor Author

Is this still necessary after #5811 is fixed?

Seems that the answer should be NO.
Reusing the timeFormatters for serializing DecimalType gains similar performance improvement. (Screenshot shows reasonable performance even with the DecimalType handling branch commented.)

image

@bowenliang123 bowenliang123 deleted the rowset-decimal branch December 6, 2023 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants