Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23047][PYTHON][SQL] Change MapVector to NullableMapVector in ArrowColumnVector #20239

Closed
wants to merge 3 commits into from

Conversation

icexelloss
Copy link
Contributor

What changes were proposed in this pull request?

This PR changes usage of MapVector in Spark codebase to use NullableMapVector.

MapVector is an internal Arrow class that is not supposed to be used directly. We should use NullableMapVector instead.

How was this patch tested?

Existing test.

@icexelloss
Copy link
Contributor Author

cc @BryanCutler @ueshin

@icexelloss
Copy link
Contributor Author

@BryanCutler I think this comes up in the Arrow sync yesterday

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's preferable to use NullableMapVector to be consistent, but since MapVector is the super class, the way it currently is shouldn't cause any errors right? I believe all vectors are created through the VectorSchemaRoot and not directly, so those would be the nullable version already.

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #85989 has finished for PR 20239 at commit 0e59098.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@icexelloss
Copy link
Contributor Author

@BryanCutler Yes there is no error currently. This should make the code cleaner though.

@ueshin
Copy link
Member

ueshin commented Jan 12, 2018

I'm not sure we can change to NullableMapVector and I'm just worrying whether the MapVector is never happened here.
LGTM if you are sure that the MapVector is never happened here.

@ueshin
Copy link
Member

ueshin commented Jan 12, 2018

Btw, I don't mean to block this pr but why does only MapVector have Nullable version, just out of curiosity.

@icexelloss
Copy link
Contributor Author

icexelloss commented Jan 12, 2018

@ueshin and @BryanCutler I took another look and the class StructAccessor defined in ArrowColumnVector never gets used for getStruct. ArrowColumnVector.getStruct() method just calls ColumnVector.getStruct() which does the right thing. StructAccessor is used for isNullAt and does the right thing.

The branch here: https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java#L250 does happen. As @BryanCutler mentioned, this is because MapVector is a parent of NullableMapVector and NullableMapVector is the actual class gets passed in.

@ueshin with regard to naming, in Arrow 0.8 most "Nullable" prefix to vector classes are removed with the exception of MapVector, which we plan to clean up in later releases.

@icexelloss
Copy link
Contributor Author

MapVector is still used in Arrow internal code but it should not be returned to user directly. https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java#L134

@BryanCutler Do you agree?

I also added a test "non nullable struct" in ArrowColumnVectorSuite

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86043 has finished for PR 20239 at commit e068966.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86048 has finished for PR 20239 at commit ab2a309.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jan 17, 2018

@BryanCutler Any comments on this? Thanks!

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checked. Another LGTM from me.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is better to use the NullableMapVector to be consistent. I'm not sure the test really adds anything but doesn't hurt I suppose, LGTM

asfgit pushed a commit that referenced this pull request Jan 17, 2018
…rrowColumnVector

## What changes were proposed in this pull request?
This PR changes usage of `MapVector` in Spark codebase to use `NullableMapVector`.

`MapVector` is an internal Arrow class that is not supposed to be used directly. We should use `NullableMapVector` instead.

## How was this patch tested?

Existing test.

Author: Li Jin <[email protected]>

Closes #20239 from icexelloss/arrow-map-vector.

(cherry picked from commit 4e6f8fb)
Signed-off-by: hyukjinkwon <[email protected]>
@HyukjinKwon
Copy link
Member

Merged to master and branch-2.3.

@asfgit asfgit closed this in 4e6f8fb Jan 17, 2018
@icexelloss
Copy link
Contributor Author

Thanks for everyone for review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants