-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use toIndexedSeq for preparing taken collection of result rowset generation #5804
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5804 +/- ##
============================================
+ Coverage 61.41% 61.45% +0.03%
Complexity 23 23
============================================
Files 608 608
Lines 35931 35961 +30
Branches 4937 4937
============================================
+ Hits 22068 22100 +32
+ Misses 11479 11470 -9
- Partials 2384 2391 +7 ☔ View full report in Codecov by Sentry. |
If the consumer of https://docs.scala-lang.org/overviews/collections/seqs.html
|
Changed to use
|
@@ -250,7 +250,7 @@ abstract class SparkOperation(session: Session) | |||
} else { | |||
val taken = iter.take(rowSetSize) | |||
RowSet.toTRowSet( | |||
taken.toSeq.asInstanceOf[Seq[Row]], | |||
taken.toIndexedSeq.asInstanceOf[Seq[Row]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toIndexedSeq
LGTM
@@ -250,7 +250,7 @@ abstract class SparkOperation(session: Session) | |||
} else { | |||
val taken = iter.take(rowSetSize) | |||
RowSet.toTRowSet( | |||
taken.toSeq.asInstanceOf[Seq[Row]], | |||
taken.toIndexedSeq.asInstanceOf[Seq[Row]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave some comments here to explain why we should use index-access-friendly data structure
5358762
to
52d25c7
Compare
🔍 Description
Issue References 🔗
As described.
Describe Your Solution 🔧
Currently in generation for result RowSet for non-arrow-based operations in
SparkOperation
, it usestoSeq
before assembling the RowSet which indeed usestoStream
that has weak performance in getting elements by index, required insideRowSet.toTRowSet - toColumnBasedSet / toRowBasedSet- toTColumn - getOrSetAsNull
methods.Refering to Scala docs for collection performance (https://docs.scala-lang.org/overviews/collections/performance-characteristics.html), the
immutable.Stream
andimmutable.List
are slow forapply
getting element by index, whilescala.Array
(basically Java's array) andimmutable.Vector
(Scala implemented) are considerably better in this operation with effectively constant time.immutable.Vector
is chosen for this improvement, considering that it guarantees immutable semantics (whileArray
doesn't), it's good for random access and is also more adaptative to Scala collection traits.Types of changes 🔖
Test Plan 🧪
Behavior Without This Pull Request ⚰️
Behavior With This Pull Request 🎉
Related Unit Tests
Checklists
📝 Author Self Checklist
📝 Committer Pre-Merge Checklist
Be nice. Be informative.