[SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and add ArrayType and StructType support. #18655

ueshin · 2017-07-17T12:49:52Z

What changes were proposed in this pull request?

This is a refactoring of ArrowConverters and related classes.

Refactor ColumnWriter as ArrowWriter.
Add ArrayType and StructType support.
Refactor ArrowConverters to skip intermediate ArrowRecordBatch creation.

How was this patch tested?

Added some tests and existing tests.

ueshin · 2017-07-17T12:51:11Z

cc @cloud-fan @BryanCutler

kiszk · 2017-07-17T13:16:58Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java

+    assert(dictionary == null);
+    boolean[] array = new boolean[count];
+    for (int i = 0; i < count; ++i) {
+      array[i] = (boolData.getAccessor().get(rowId + i) == 1);


Can we move boolData.getAccessor() out of the loop if it is a loop invariant? Or, can we use nulls?
Ditto for other types (e.g. getBytes()).

kiszk · 2017-07-17T15:07:32Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java

+
+  @Override
+  public boolean getBoolean(int rowId) {
+    return boolData.getAccessor().get(rowId) == 1;


Can we use nulls? If so, it would be better to use another name instead of nulls.

SparkQA · 2017-07-17T15:16:35Z

Test build #79668 has finished for PR 18655 at commit 58cd465.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-07-17T15:26:23Z

Good feature, but can we split this PR into smaller PRs for ease of review since it looks large?
For example, since ArrowColumnVector is not used in refactored code, this part can be moved to another PR.

BryanCutler · 2017-07-17T22:15:23Z

Thanks for this @ueshin. I agree with @kiszk that it would be easier to review if you can split this into smaller PRs, maybe keep the additional type support separate? I'm all for refactoring this too, but could you elaborate with some details on why you are refactoring ColumnWriter and ArrowConverters? Thanks!

cloud-fan · 2017-07-18T04:32:54Z

yea let's put ArrowColumnVector and its tests in a new PR and merge that first.

ArrowWriter will also be used for pandas UDF, see https://issues.apache.org/jira/browse/SPARK-21190 for more details, so it makes sense to move it to a separated file.

ueshin · 2017-07-18T04:40:47Z

Thank you for your comments.
I agree that we should split this into smaller PRs. I'll push another commit to remove ArrowColumnVector from this as soon as possible.

…ultiple prs.

SparkQA · 2017-07-18T07:04:54Z

Test build #79696 has finished for PR 18655 at commit 8ffedda.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-07-18T07:07:40Z

Jenkins, retest this please.

ueshin · 2017-07-18T07:22:06Z

@BryanCutler I'd like to share the motivation of refactoring ArrowConverters and ColumnWriter.

For ColumnWriter, at first I'd like to support complex types like ArrayType and StructType, so I refactored it based on your ColumnWriter implementation. And then I renamed and moved the package so that we can also use it for pandas UDF as @cloud-fan mentioned. As you might see before, I'll introduce ArrowColumnVector as a reader for Arrow vectors as well.

For ArrowConverters, I thought we can skip the intermediate ArrowRecordBatch creation in ArrowConverters.toPayloadIterator(). What do you think about that?

Thanks!

SparkQA · 2017-07-18T09:34:47Z

Test build #79698 has finished for PR 18655 at commit 8ffedda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-07-18T18:38:04Z

For ArrowConverters, I thought we can skip the intermediate ArrowRecordBatch creation in ArrowConverters.toPayloadIterator(). What do you think about that?

Ok, I see. By using ArrowWriter directly on the root, then the ArrowFileWriter can use that same root in creating the Byte array. So no need to create an intermediate ArrowRecordBatch. That sounds good to me!

For ColumnWriter, at first I'd like to support complex types like ArrayType and StructType, so I refactored it based on your ColumnWriter implementation. And then I renamed and moved the package

That's fine, but do they need to be in o.a.s.sql.execution.vectorized? If so, then what's the point of having a o.a.s.sql.execution.arrow package if ArrowUtils and ArrowWriter are not even there?

BryanCutler

Thanks for the PR @ueshin! I have some concerns about using the TaskContext in the iterator to release resources. I believe there was a bug fix in Arrow 0.4.1 for the decimal type, and I had planned to look into upgraded to support that type. Also, with the added type support here, how does it affect the python side and are you planning on adding tests there?

As for the refactoring, if it offers a performance improvement that is great. However, it seems a little out of order to me to refactor and move files around to support a SPIP that has not reached a consensus and not been voted on. Just my thoughts.

BryanCutler · 2017-07-18T18:44:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

-      recordsInBatch += 1
+    context.addTaskCompletionListener { _ =>
+      root.close()
+      allocator.close()


It seems a little odd to me to tie an iterator to a TaskContext, why not just close resources as soon as the row iterator is consumed?

I was worried about memory leak when an exception happens during iterating. In that case, the task will fail before the row iterator is completely consumed.

Yeah, good point. What about closing resources in both ways? So have the listener close for the case that something fails, otherwise once the the row iterator is fully consumed then close immediately. I'm not really sure at what exact point the task completion listener callback is done, is it dependent on any IO?

BryanCutler · 2017-07-18T18:49:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

+        val writer = new ArrowFileWriter(root, null, Channels.newChannel(out))
+
+        Utils.tryWithSafeFinally {
+          var rowId = 0


nit: maybe rowCount instead of rowId because it is a count of how many rows in the batch so far and not a unique id?

Thanks! I'll update it.

BryanCutler · 2017-07-18T18:57:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/vectorized/ArrowWriter.scala

+
+  def setNull(): Unit
+  def setValue(input: SpecializedGetters, ordinal: Int): Unit
+  def skip(): Unit


What's the purpose of the skip() method?

This is for the case if the value of the struct type is null.
I believe if the value of the struct type, the fields should have some values for the same row.

BryanCutler · 2017-07-18T18:57:52Z

sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala

@@ -391,6 +392,85 @@ class ArrowConvertersSuite extends SharedSQLContext with BeforeAndAfterAll {
    collectAndValidate(df, json, "floating_point-double_precision.json")
  }

+  ignore("decimal conversion") {


Why ignore this?

Oh, I'm sorry, I should have mentioned it.
It seems like JsonFileReader doesn't support DecimalType, so I ignored it for now.
But now I'm thinking that If Arrow 0.4.0 has a bug for the decimal type as you said, should I remove decimal type support from this pr and add support in the following prs?

That might be true, I haven't looked into it yet. I can work on adding support on the Arrow side, so I'll try to check on that and see where it stands in the upcoming 0.5 release.

Arrow integration support for DecimalType isn't slated until v0.6, so it might work but there are no guarantees that a record batch in Java will equal when that batch is read by Python/C++. Also, we can't test here until the JsonFileReader supports it also. I made the Arrow JIRA here https://issues.apache.org/jira/browse/ARROW-1238

ueshin · 2017-07-19T03:43:56Z

I see, I'll move files back to arrow package.

SparkQA · 2017-07-19T06:31:53Z

Test build #79737 has finished for PR 18655 at commit b5988f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-19T07:04:54Z

Test build #79741 has finished for PR 18655 at commit a50a271.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-07-19T07:44:26Z

Jenkins, retest this please.

SparkQA · 2017-07-19T10:13:14Z

Test build #79744 has finished for PR 18655 at commit a50a271.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…umed.

SparkQA · 2017-07-20T06:04:29Z

Test build #79786 has finished for PR 18655 at commit 7084b38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-20T16:02:12Z

Test build #79798 has finished for PR 18655 at commit 6fc4da0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ArrowWriterSuite extends SparkFunSuite

wesm · 2017-07-25T14:06:51Z

On DecimalType, I want to point out that we haven't hardened the memory format and integration tests between Java<->C++ within Arrow. It would be great if you could help with this -- we ran into a problem in C++ where we needed an extra sign bit with 16-byte high precision decimals. So we have 3 memory representations:

4 byte decimals (low precision)
8 byte
16 byte plus sign bitmap

What is Spark's internal memory representation? cc @cpcloud

cpcloud · 2017-07-25T16:24:46Z

Based on this code: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala#L429-L547

It looks like there are two types:

8 byte compact decimal that fits in a Java long
Up to length 16 Array[byte] representation (based on the BINARY type).

I think Arrow's Decimal representation in Java is almost identical to this.

Looking at the BigInteger Java implementation (which is what BigDecimal sits on top of) the sign is carried around in the first byte of the array.

wesm · 2017-07-26T01:57:59Z

There are a bunch of open JIRAs about decimals in Arrow: https://issues.apache.org/jira/issues/?filter=12334829&jql=project%20%3D%20ARROW%20AND%20status%20in%20(%22In%20Review%22%2C%20Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22decimal%22 between these JIRAs and the mailing list if would be good to come up with a game plan for integration tests between Java and C++ (and thus Python) so we can enable Spark to send Python decimals

ueshin · 2017-07-26T02:08:18Z

@BryanCutler @wesm @cpcloud Thank you for reviewing this.
If the remaining issue here is only DecimalType support, I'd like to separate it from this pr and merge this first to avoid duplicating efforts around writers.
What do you think?

cloud-fan · 2017-07-26T03:27:20Z

yes let leave decimal support for folllow-ups

cloud-fan · 2017-07-26T03:54:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

-        columnWriters(i).write(row)
-        i += 1
+    context.addTaskCompletionListener { _ =>
+      if (!closed) {


do we really need this? I think it's ok to close twice?

The allocator can be closed twice, but the root throws an exception after allocator is closed.

is this a bug in arrow? cc @BryanCutler

The root just releases the buffers from the FieldVectors, so I would think it should be able to handle being closed twice. I'll check tomorrow if seems reasonable.

I filed https://issues.apache.org/jira/browse/ARROW-1283 to fix this. For now, it looks like we need this.

BryanCutler · 2017-07-26T04:26:36Z

+1 on holding off for DecimalType support

cloud-fan · 2017-07-26T04:35:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

+
+  def writeSkip(): Unit = {
+    skip()
+    count += 1


For skipping purpose, is it enough to just do count += 1? e.g. vector.set(1, v1); vector.set(3, v3), value 2 is skipped.

Basically, yes, it's enough except for StructType, but should we set null bit to 1 for skipped value?

cloud-fan · 2017-07-26T04:38:45Z

sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowWriterSuite.scala

+      val reader = new ArrowColumnVector(writer.root.getFieldVectors().get(0))
+      data.zipWithIndex.foreach {
+        case (null, rowId) => assert(reader.isNullAt(rowId))
+        case (datum, rowId) => assert(get(reader, rowId) === datum)


we can do something like

dt match { case BooleanType => reader.getBoolean(rowid) case IntegerType => ... ... }

Then the caller side doesn't need to pass in a get

Thanks, I'll update it.

HyukjinKwon · 2017-07-26T04:54:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

+
+  private def createFieldWriter(vector: ValueVector): ArrowFieldWriter = {
+    val field = vector.getField()
+    ArrowUtils.fromArrowField(field) match {


Would it be better to do as below?

(ArrowUtils.fromArrowField(field), vector) match { case (_: BooleanType, vector: NullableBitVector) => new BooleanWriter(vector) case (_: ByteType, vector: NullableTinyIntVector) => new ByteWriter(vector) ...

Thanks, I'll modify it.

HyukjinKwon · 2017-07-26T05:04:15Z

sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala

+    val a_arr = Seq(Seq(1, 2), Seq(3, 4), Seq(), Seq(5))
+    val b_arr = Seq(Some(Seq(1, 2)), None, Some(Seq()), None)
+    val c_arr = Seq(Seq(Some(1), Some(2)), Seq(Some(3), None), Seq(), Seq(Some(5)))
+    val d_arr = Seq(Seq(Seq(1, 2)), Seq(Seq(3), Seq()), Seq(), Seq(Seq(5)))


How about camelCase naming?

Thanks, I'll modify it.

SparkQA · 2017-07-26T06:52:04Z

Test build #79955 has finished for PR 18655 at commit 5bbb46f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-26T07:04:53Z

Test build #79957 has finished for PR 18655 at commit 19f3973.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-07-26T07:27:45Z

Jenkins, retest this please.

SparkQA · 2017-07-26T09:56:36Z

Test build #79960 has finished for PR 18655 at commit 19f3973.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-27T10:12:58Z

Test build #79995 has finished for PR 18655 at commit 0bac10d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-27T10:15:58Z

Test build #79996 has finished for PR 18655 at commit b85dc23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-27T11:20:03Z

LGTM, merging to master!

ueshin · 2017-07-28T03:54:20Z

@BryanCutler @wesm @cpcloud I filed a JIRA issue for decimal type support SPARK-21552 and sent a pr for it as WIP #18754.
Let's move on there for discussing decimal type support.

ueshin added 4 commits July 17, 2017 20:54

Introduce ArrowWriter and ArrowColumnVector.

1439fe6

Use ArrowWriter for ArrowConverters.

6fcf700

Refactor ArrowConverters.

579def2

Move releasing memory into task completion listener.

58cd465

kiszk reviewed Jul 17, 2017

View reviewed changes

Revert ArrowColumnVector related implementations to split a pr into m…

8ffedda

…ultiple prs.

BryanCutler reviewed Jul 18, 2017

View reviewed changes

ueshin added 3 commits July 19, 2017 12:50

Use rowCount instead of rowId.

e3a4fc0

Move files back to arrow package.

b5988f9

Modify ArrowUtils to avoid deprecated APIs.

a50a271

Modify to close resources also immediately after row iterator is cons…

7084b38

…umed.

Merge branch 'master' into issues/SPARK-21440

53fba41

Add ArrowWriterSuite.

6fc4da0

cloud-fan reviewed Jul 26, 2017

View reviewed changes

ueshin added 2 commits July 26, 2017 13:08

Merge branch 'master' into issues/SPARK-21440

926a624

Remove DecimalType support for now.

5bbb46f

ueshin changed the title ~~[SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and add DecimalType, ArrayType and StructType support.~~ [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and add ArrayType and StructType support. Jul 26, 2017

cloud-fan reviewed Jul 26, 2017

View reviewed changes

HyukjinKwon reviewed Jul 26, 2017

View reviewed changes

ueshin added 2 commits July 26, 2017 15:00

Fix skip() for StructType.

beff6ef

Address comments.

19f3973

ueshin added 2 commits July 27, 2017 16:36

Modify skip semantic.

0bac10d

Inline writeNull().

b85dc23

asfgit closed this in 2ff35a0 Jul 27, 2017

BryanCutler mentioned this pull request Jul 31, 2017

ARROW-1283: [JAVA] Allow VectorSchemaRoot to close more than once apache/arrow#898

Closed

[SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and add ArrayType and StructType support. #18655

[SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and add ArrayType and StructType support. #18655

Conversation

ueshin commented Jul 17, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

ueshin commented Jul 17, 2017

kiszk Jul 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 17, 2017

kiszk commented Jul 17, 2017

BryanCutler commented Jul 17, 2017

cloud-fan commented Jul 18, 2017

ueshin commented Jul 18, 2017

SparkQA commented Jul 18, 2017

ueshin commented Jul 18, 2017

ueshin commented Jul 18, 2017

SparkQA commented Jul 18, 2017

BryanCutler commented Jul 18, 2017

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin commented Jul 19, 2017

SparkQA commented Jul 19, 2017

SparkQA commented Jul 19, 2017

ueshin commented Jul 19, 2017

SparkQA commented Jul 19, 2017

SparkQA commented Jul 20, 2017

SparkQA commented Jul 20, 2017

wesm commented Jul 25, 2017

cpcloud commented Jul 25, 2017 • edited Loading

wesm commented Jul 26, 2017

ueshin commented Jul 26, 2017

cloud-fan commented Jul 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Jul 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 26, 2017

SparkQA commented Jul 26, 2017

ueshin commented Jul 26, 2017

SparkQA commented Jul 26, 2017

SparkQA commented Jul 27, 2017

SparkQA commented Jul 27, 2017

cloud-fan commented Jul 27, 2017

ueshin commented Jul 28, 2017

ueshin commented Jul 17, 2017 •

edited

Loading

kiszk Jul 17, 2017 •

edited

Loading

cpcloud commented Jul 25, 2017 •

edited

Loading