Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use toIndexedSeq for preparing taken collection of result rowset generation #5804

Closed
wants to merge 0 commits into from

Conversation

bowenliang123
Copy link
Contributor

@bowenliang123 bowenliang123 commented Dec 1, 2023

🔍 Description

Issue References 🔗

As described.

Describe Your Solution 🔧

Currently in generation for result RowSet for non-arrow-based operations in SparkOperation, it uses toSeq before assembling the RowSet which indeed uses toStream that has weak performance in getting elements by index, required inside RowSet.toTRowSet - toColumnBasedSet / toRowBasedSet- toTColumn - getOrSetAsNull methods.

Refering to Scala docs for collection performance (https://docs.scala-lang.org/overviews/collections/performance-characteristics.html), the immutable.Stream and immutable.List are slow for apply getting element by index, while scala.Array(basically Java's array) and immutable.Vector (Scala implemented) are considerably better in this operation with effectively constant time.

immutable.Vector is chosen for this improvement, considering that it guarantees immutable semantics (while Array doesn't), it's good for random access and is also more adaptative to Scala collection traits.

Types of changes 🔖

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests


Checklists

📝 Author Self Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • This patch was not authored or co-authored using Generative Tooling

📝 Committer Pre-Merge Checklist

  • Pull request title is okay.
  • No license issues.
  • Milestone correctly set?
  • Test coverage is ok
  • Assignees are selected.
  • Minimum number of approvals
  • No changes are requested

Be nice. Be informative.

@bowenliang123 bowenliang123 marked this pull request as draft December 1, 2023 19:47
@codecov-commenter
Copy link

codecov-commenter commented Dec 1, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (13af6ae) 61.41% compared to head (0944e43) 61.45%.
Report is 9 commits behind head on master.

❗ Current head 0944e43 differs from pull request most recent head 5358762. Consider uploading reports for the commit 5358762 to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #5804      +/-   ##
============================================
+ Coverage     61.41%   61.45%   +0.03%     
  Complexity       23       23              
============================================
  Files           608      608              
  Lines         35931    35961      +30     
  Branches       4937     4937              
============================================
+ Hits          22068    22100      +32     
+ Misses        11479    11470       -9     
- Partials       2384     2391       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pan3793
Copy link
Member

pan3793 commented Dec 4, 2023

If the consumer of TRowSet only uses indexed access, I suppose toIndexedSeq should be used.

https://docs.scala-lang.org/overviews/collections/seqs.html

Trait Seq has two subtraits LinearSeq, and IndexedSeq. These do not add any new operations, but each offers different performance characteristics: A linear sequence has efficient head and tail operations, whereas an indexed sequence has efficient apply, length, and (if mutable) update operations. Frequently used linear sequences are scala.collection.immutable.List and scala.collection.immutable.Stream. Frequently used indexed sequences are scala.Array and scala.collection.mutable.ArrayBuffer. The Vector class provides an interesting compromise between indexed and linear access. It has both effectively constant time indexing overhead and constant time linear access overhead. Because of this, vectors are a good foundation for mixed access patterns where both indexed and linear accesses are used. You’ll learn more on vectors later.

@bowenliang123
Copy link
Contributor Author

bowenliang123 commented Dec 4, 2023

Changed to use toIndexedSeq, which also comes to a generated Vector instance, the same as toVector.

val iterator = Iterator(1, 2, 3, 4, 5)
val indexedSeq = iterator.toIndexedSeq

println(indexedSeq) // output: Vector(1, 2, 3, 4, 5)

@@ -250,7 +250,7 @@ abstract class SparkOperation(session: Session)
} else {
val taken = iter.take(rowSetSize)
RowSet.toTRowSet(
taken.toSeq.asInstanceOf[Seq[Row]],
taken.toIndexedSeq.asInstanceOf[Seq[Row]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -250,7 +250,7 @@ abstract class SparkOperation(session: Session)
} else {
val taken = iter.take(rowSetSize)
RowSet.toTRowSet(
taken.toSeq.asInstanceOf[Seq[Row]],
taken.toIndexedSeq.asInstanceOf[Seq[Row]],
Copy link
Member

@pan3793 pan3793 Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave some comments here to explain why we should use index-access-friendly data structure

@bowenliang123 bowenliang123 changed the title Use vector for generate result rowset Use toIndexedSeq for preparing taken collection of result rowset generation Dec 6, 2023
@bowenliang123 bowenliang123 reopened this Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants