Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and add ArrayType and StructType support. #18655

Closed
wants to merge 17 commits into from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Jul 17, 2017

What changes were proposed in this pull request?

This is a refactoring of ArrowConverters and related classes.

  1. Refactor ColumnWriter as ArrowWriter.
  2. Add ArrayType and StructType support.
  3. Refactor ArrowConverters to skip intermediate ArrowRecordBatch creation.

How was this patch tested?

Added some tests and existing tests.

@ueshin
Copy link
Member Author

ueshin commented Jul 17, 2017

cc @cloud-fan @BryanCutler

assert(dictionary == null);
boolean[] array = new boolean[count];
for (int i = 0; i < count; ++i) {
array[i] = (boolData.getAccessor().get(rowId + i) == 1);
Copy link
Member

@kiszk kiszk Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move boolData.getAccessor() out of the loop if it is a loop invariant? Or, can we use nulls?
Ditto for other types (e.g. getBytes()).


@Override
public boolean getBoolean(int rowId) {
return boolData.getAccessor().get(rowId) == 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use nulls? If so, it would be better to use another name instead of nulls.

@SparkQA
Copy link

SparkQA commented Jul 17, 2017

Test build #79668 has finished for PR 18655 at commit 58cd465.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Jul 17, 2017

Good feature, but can we split this PR into smaller PRs for ease of review since it looks large?
For example, since ArrowColumnVector is not used in refactored code, this part can be moved to another PR.

@BryanCutler
Copy link
Member

Thanks for this @ueshin. I agree with @kiszk that it would be easier to review if you can split this into smaller PRs, maybe keep the additional type support separate? I'm all for refactoring this too, but could you elaborate with some details on why you are refactoring ColumnWriter and ArrowConverters? Thanks!

@cloud-fan
Copy link
Contributor

yea let's put ArrowColumnVector and its tests in a new PR and merge that first.

ArrowWriter will also be used for pandas UDF, see https://issues.apache.org/jira/browse/SPARK-21190 for more details, so it makes sense to move it to a separated file.

@ueshin
Copy link
Member Author

ueshin commented Jul 18, 2017

Thank you for your comments.
I agree that we should split this into smaller PRs. I'll push another commit to remove ArrowColumnVector from this as soon as possible.

@SparkQA
Copy link

SparkQA commented Jul 18, 2017

Test build #79696 has finished for PR 18655 at commit 8ffedda.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Jul 18, 2017

Jenkins, retest this please.

@ueshin
Copy link
Member Author

ueshin commented Jul 18, 2017

@BryanCutler I'd like to share the motivation of refactoring ArrowConverters and ColumnWriter.

For ColumnWriter, at first I'd like to support complex types like ArrayType and StructType, so I refactored it based on your ColumnWriter implementation. And then I renamed and moved the package so that we can also use it for pandas UDF as @cloud-fan mentioned. As you might see before, I'll introduce ArrowColumnVector as a reader for Arrow vectors as well.

For ArrowConverters, I thought we can skip the intermediate ArrowRecordBatch creation in ArrowConverters.toPayloadIterator(). What do you think about that?

Thanks!

@SparkQA
Copy link

SparkQA commented Jul 18, 2017

Test build #79698 has finished for PR 18655 at commit 8ffedda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

For ArrowConverters, I thought we can skip the intermediate ArrowRecordBatch creation in ArrowConverters.toPayloadIterator(). What do you think about that?

Ok, I see. By using ArrowWriter directly on the root, then the ArrowFileWriter can use that same root in creating the Byte array. So no need to create an intermediate ArrowRecordBatch. That sounds good to me!

For ColumnWriter, at first I'd like to support complex types like ArrayType and StructType, so I refactored it based on your ColumnWriter implementation. And then I renamed and moved the package

That's fine, but do they need to be in o.a.s.sql.execution.vectorized? If so, then what's the point of having a o.a.s.sql.execution.arrow package if ArrowUtils and ArrowWriter are not even there?

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @ueshin! I have some concerns about using the TaskContext in the iterator to release resources. I believe there was a bug fix in Arrow 0.4.1 for the decimal type, and I had planned to look into upgraded to support that type. Also, with the added type support here, how does it affect the python side and are you planning on adding tests there?

As for the refactoring, if it offers a performance improvement that is great. However, it seems a little out of order to me to refactor and move files around to support a SPIP that has not reached a consensus and not been voted on. Just my thoughts.

recordsInBatch += 1
context.addTaskCompletionListener { _ =>
root.close()
allocator.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little odd to me to tie an iterator to a TaskContext, why not just close resources as soon as the row iterator is consumed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried about memory leak when an exception happens during iterating. In that case, the task will fail before the row iterator is completely consumed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. What about closing resources in both ways? So have the listener close for the case that something fails, otherwise once the the row iterator is fully consumed then close immediately. I'm not really sure at what exact point the task completion listener callback is done, is it dependent on any IO?

val writer = new ArrowFileWriter(root, null, Channels.newChannel(out))

Utils.tryWithSafeFinally {
var rowId = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe rowCount instead of rowId because it is a count of how many rows in the batch so far and not a unique id?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll update it.


def setNull(): Unit
def setValue(input: SpecializedGetters, ordinal: Int): Unit
def skip(): Unit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the skip() method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the case if the value of the struct type is null.
I believe if the value of the struct type, the fields should have some values for the same row.

@@ -391,6 +392,85 @@ class ArrowConvertersSuite extends SharedSQLContext with BeforeAndAfterAll {
collectAndValidate(df, json, "floating_point-double_precision.json")
}

ignore("decimal conversion") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm sorry, I should have mentioned it.
It seems like JsonFileReader doesn't support DecimalType, so I ignored it for now.
But now I'm thinking that If Arrow 0.4.0 has a bug for the decimal type as you said, should I remove decimal type support from this pr and add support in the following prs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be true, I haven't looked into it yet. I can work on adding support on the Arrow side, so I'll try to check on that and see where it stands in the upcoming 0.5 release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow integration support for DecimalType isn't slated until v0.6, so it might work but there are no guarantees that a record batch in Java will equal when that batch is read by Python/C++. Also, we can't test here until the JsonFileReader supports it also. I made the Arrow JIRA here https://issues.apache.org/jira/browse/ARROW-1238

@ueshin
Copy link
Member Author

ueshin commented Jul 19, 2017

I see, I'll move files back to arrow package.

@SparkQA
Copy link

SparkQA commented Jul 19, 2017

Test build #79737 has finished for PR 18655 at commit b5988f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2017

Test build #79741 has finished for PR 18655 at commit a50a271.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Jul 19, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 19, 2017

Test build #79744 has finished for PR 18655 at commit a50a271.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2017

Test build #79786 has finished for PR 18655 at commit 7084b38.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2017

Test build #79798 has finished for PR 18655 at commit 6fc4da0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ArrowWriterSuite extends SparkFunSuite

@wesm
Copy link
Member

wesm commented Jul 25, 2017

On DecimalType, I want to point out that we haven't hardened the memory format and integration tests between Java<->C++ within Arrow. It would be great if you could help with this -- we ran into a problem in C++ where we needed an extra sign bit with 16-byte high precision decimals. So we have 3 memory representations:

  • 4 byte decimals (low precision)
  • 8 byte
  • 16 byte plus sign bitmap

What is Spark's internal memory representation? cc @cpcloud

@cpcloud
Copy link

cpcloud commented Jul 25, 2017

Based on this code: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala#L429-L547

It looks like there are two types:

  1. 8 byte compact decimal that fits in a Java long
  2. Up to length 16 Array[byte] representation (based on the BINARY type).

I think Arrow's Decimal representation in Java is almost identical to this.

Looking at the BigInteger Java implementation (which is what BigDecimal sits on top of) the sign is carried around in the first byte of the array.

@wesm
Copy link
Member

wesm commented Jul 26, 2017

There are a bunch of open JIRAs about decimals in Arrow: https://issues.apache.org/jira/issues/?filter=12334829&jql=project%20%3D%20ARROW%20AND%20status%20in%20(%22In%20Review%22%2C%20Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22decimal%22 between these JIRAs and the mailing list if would be good to come up with a game plan for integration tests between Java and C++ (and thus Python) so we can enable Spark to send Python decimals

@ueshin
Copy link
Member Author

ueshin commented Jul 26, 2017

@BryanCutler @wesm @cpcloud Thank you for reviewing this.
If the remaining issue here is only DecimalType support, I'd like to separate it from this pr and merge this first to avoid duplicating efforts around writers.
What do you think?

@cloud-fan
Copy link
Contributor

yes let leave decimal support for folllow-ups

columnWriters(i).write(row)
i += 1
context.addTaskCompletionListener { _ =>
if (!closed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this? I think it's ok to close twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocator can be closed twice, but the root throws an exception after allocator is closed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a bug in arrow? cc @BryanCutler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root just releases the buffers from the FieldVectors, so I would think it should be able to handle being closed twice. I'll check tomorrow if seems reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed https://issues.apache.org/jira/browse/ARROW-1283 to fix this. For now, it looks like we need this.

@ueshin ueshin changed the title [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and add DecimalType, ArrayType and StructType support. [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and add ArrayType and StructType support. Jul 26, 2017
@BryanCutler
Copy link
Member

+1 on holding off for DecimalType support


def writeSkip(): Unit = {
skip()
count += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For skipping purpose, is it enough to just do count += 1? e.g. vector.set(1, v1); vector.set(3, v3), value 2 is skipped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, yes, it's enough except for StructType, but should we set null bit to 1 for skipped value?

val reader = new ArrowColumnVector(writer.root.getFieldVectors().get(0))
data.zipWithIndex.foreach {
case (null, rowId) => assert(reader.isNullAt(rowId))
case (datum, rowId) => assert(get(reader, rowId) === datum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do something like

dt match {
 case BooleanType => reader.getBoolean(rowid)
 case IntegerType => ...
 ...
}

Then the caller side doesn't need to pass in a get

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll update it.


private def createFieldWriter(vector: ValueVector): ArrowFieldWriter = {
val field = vector.getField()
ArrowUtils.fromArrowField(field) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to do as below?

    (ArrowUtils.fromArrowField(field), vector) match {
      case (_: BooleanType, vector: NullableBitVector) => new BooleanWriter(vector)
      case (_: ByteType, vector: NullableTinyIntVector) => new ByteWriter(vector)
  ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll modify it.

val a_arr = Seq(Seq(1, 2), Seq(3, 4), Seq(), Seq(5))
val b_arr = Seq(Some(Seq(1, 2)), None, Some(Seq()), None)
val c_arr = Seq(Seq(Some(1), Some(2)), Seq(Some(3), None), Seq(), Seq(Some(5)))
val d_arr = Seq(Seq(Seq(1, 2)), Seq(Seq(3), Seq()), Seq(), Seq(Seq(5)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about camelCase naming?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll modify it.

@SparkQA
Copy link

SparkQA commented Jul 26, 2017

Test build #79955 has finished for PR 18655 at commit 5bbb46f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2017

Test build #79957 has finished for PR 18655 at commit 19f3973.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Jul 26, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 26, 2017

Test build #79960 has finished for PR 18655 at commit 19f3973.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2017

Test build #79995 has finished for PR 18655 at commit 0bac10d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2017

Test build #79996 has finished for PR 18655 at commit b85dc23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 2ff35a0 Jul 27, 2017
@ueshin
Copy link
Member Author

ueshin commented Jul 28, 2017

@BryanCutler @wesm @cpcloud I filed a JIRA issue for decimal type support SPARK-21552 and sent a pr for it as WIP #18754.
Let's move on there for discussing decimal type support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants