Use toIndexedSeq for preparing taken collection of result rowset generation #5804

bowenliang123 · 2023-12-01T19:44:09Z

🔍 Description

Issue References 🔗

As described.

Describe Your Solution 🔧

Currently in generation for result RowSet for non-arrow-based operations in SparkOperation, it uses toSeq before assembling the RowSet which indeed uses toStream that has weak performance in getting elements by index, required inside RowSet.toTRowSet - toColumnBasedSet / toRowBasedSet- toTColumn - getOrSetAsNull methods.

Refering to Scala docs for collection performance (https://docs.scala-lang.org/overviews/collections/performance-characteristics.html), the immutable.Stream and immutable.List are slow for apply getting element by index, while scala.Array(basically Java's array) and immutable.Vector (Scala implemented) are considerably better in this operation with effectively constant time.

immutable.Vector is chosen for this improvement, considering that it guarantees immutable semantics (while Array doesn't), it's good for random access and is also more adaptative to Scala collection traits.

Types of changes 🔖

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

Checklists

📝 Author Self Checklist

My code follows the style guidelines of this project
I have performed a self-review
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
This patch was not authored or co-authored using Generative Tooling

📝 Committer Pre-Merge Checklist

Be nice. Be informative.

codecov-commenter · 2023-12-01T21:11:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (13af6ae) 61.41% compared to head (0944e43) 61.45%.
Report is 9 commits behind head on master.

❗ Current head 0944e43 differs from pull request most recent head 5358762. Consider uploading reports for the commit 5358762 to get more accurate results

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #5804      +/-   ##
============================================
+ Coverage     61.41%   61.45%   +0.03%     
  Complexity       23       23              
============================================
  Files           608      608              
  Lines         35931    35961      +30     
  Branches       4937     4937              
============================================
+ Hits          22068    22100      +32     
+ Misses        11479    11470       -9     
- Partials       2384     2391       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pan3793 · 2023-12-04T03:02:38Z

If the consumer of TRowSet only uses indexed access, I suppose toIndexedSeq should be used.

https://docs.scala-lang.org/overviews/collections/seqs.html

Trait Seq has two subtraits LinearSeq, and IndexedSeq. These do not add any new operations, but each offers different performance characteristics: A linear sequence has efficient head and tail operations, whereas an indexed sequence has efficient apply, length, and (if mutable) update operations. Frequently used linear sequences are scala.collection.immutable.List and scala.collection.immutable.Stream. Frequently used indexed sequences are scala.Array and scala.collection.mutable.ArrayBuffer. The Vector class provides an interesting compromise between indexed and linear access. It has both effectively constant time indexing overhead and constant time linear access overhead. Because of this, vectors are a good foundation for mixed access patterns where both indexed and linear accesses are used. You’ll learn more on vectors later.

bowenliang123 · 2023-12-04T15:05:55Z

Changed to use toIndexedSeq, which also comes to a generated Vector instance, the same as toVector.

val iterator = Iterator(1, 2, 3, 4, 5)
val indexedSeq = iterator.toIndexedSeq

println(indexedSeq) // output: Vector(1, 2, 3, 4, 5)

cxzl25 · 2023-12-05T10:09:55Z

...park-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/operation/SparkOperation.scala

@@ -250,7 +250,7 @@ abstract class SparkOperation(session: Session)
          } else {
            val taken = iter.take(rowSetSize)
            RowSet.toTRowSet(
-              taken.toSeq.asInstanceOf[Seq[Row]],
+              taken.toIndexedSeq.asInstanceOf[Seq[Row]],


toIndexedSeq LGTM

https://github.com/scala/scala/blob/aa29f37d7d1182061da5e689aee3eea7a9754f06/src/library/scala/collection/LinearSeqOptimized.scala#L61-L69

https://github.com/scala/scala/blob/aa29f37d7d1182061da5e689aee3eea7a9754f06/src/library/scala/collection/immutable/Vector.scala#L121-L124

pan3793 · 2023-12-05T10:27:50Z

...park-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/operation/SparkOperation.scala

@@ -250,7 +250,7 @@ abstract class SparkOperation(session: Session)
          } else {
            val taken = iter.take(rowSetSize)
            RowSet.toTRowSet(
-              taken.toSeq.asInstanceOf[Seq[Row]],
+              taken.toIndexedSeq.asInstanceOf[Seq[Row]],


Please leave some comments here to explain why we should use index-access-friendly data structure

github-actions bot added the module:spark label Dec 1, 2023

bowenliang123 marked this pull request as draft December 1, 2023 19:47

bowenliang123 mentioned this pull request Dec 3, 2023

[Umbrella] Improvements and evaluation for TRowSet generation of Spark Engine #5808

Open

12 tasks

cxzl25 reviewed Dec 5, 2023

View reviewed changes

pan3793 reviewed Dec 5, 2023

View reviewed changes

bowenliang123 closed this Dec 5, 2023

bowenliang123 changed the title ~~Use vector for generate result rowset~~ Use toIndexedSeq for preparing taken collection of result rowset generation Dec 6, 2023

bowenliang123 reopened this Dec 6, 2023

bowenliang123 closed this Dec 6, 2023

bowenliang123 force-pushed the rowset-to branch from 5358762 to 52d25c7 Compare December 6, 2023 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use toIndexedSeq for preparing taken collection of result rowset generation #5804

Use toIndexedSeq for preparing taken collection of result rowset generation #5804

bowenliang123 commented Dec 1, 2023 •

edited

Loading

codecov-commenter commented Dec 1, 2023 •

edited

Loading

pan3793 commented Dec 4, 2023 •

edited

Loading

bowenliang123 commented Dec 4, 2023 •

edited

Loading

cxzl25 Dec 5, 2023

pan3793 Dec 5, 2023 •

edited

Loading

Use toIndexedSeq for preparing taken collection of result rowset generation #5804

Use toIndexedSeq for preparing taken collection of result rowset generation #5804

Conversation

bowenliang123 commented Dec 1, 2023 • edited Loading

🔍 Description

Issue References 🔗

Describe Your Solution 🔧

Types of changes 🔖

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

Checklists

📝 Author Self Checklist

📝 Committer Pre-Merge Checklist

codecov-commenter commented Dec 1, 2023 • edited Loading

Codecov Report

pan3793 commented Dec 4, 2023 • edited Loading

bowenliang123 commented Dec 4, 2023 • edited Loading

cxzl25 Dec 5, 2023

Choose a reason for hiding this comment

pan3793 Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

bowenliang123 commented Dec 1, 2023 •

edited

Loading

codecov-commenter commented Dec 1, 2023 •

edited

Loading

pan3793 commented Dec 4, 2023 •

edited

Loading

bowenliang123 commented Dec 4, 2023 •

edited

Loading

pan3793 Dec 5, 2023 •

edited

Loading