Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5583: [Java] When the isSet of a NullableValueHolder is 0, the buffer field should not be used #4543

Closed
wants to merge 3 commits into from

Conversation

liyafan82
Copy link
Contributor

@liyafan82 liyafan82 commented Jun 13, 2019

For each variable-width vector, like the VarCharVector, it has a set method that uses a NullableValueHolder as the input parameter. When the isSet field is set to 0, it means the value to set is null, so the buffer field of the NullableValueHolder is invalid, and should not be used.

For example, the user may set a null value in the VarCharVector with the following code snippet:

NullableVarCharHolder holder = new NullableVarCharHolder();
holder.isSet = 0;
...
varCharVector.set(i, holder);

Please note that in the code above, the holder.buffer is not set, so it is null. According to the VarCharVector#set method, it will set the bytes using holder.buffer even if holder.isSet equals 0. This will lead to an exception.

@codecov-io
Copy link

codecov-io commented Jun 13, 2019

Codecov Report

Merging #4543 into master will increase coverage by 0.92%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4543      +/-   ##
==========================================
+ Coverage   88.55%   89.47%   +0.92%     
==========================================
  Files         796      651     -145     
  Lines      103239    92152   -11087     
  Branches     1253        0    -1253     
==========================================
- Hits        91425    82456    -8969     
+ Misses      11569     9696    -1873     
+ Partials      245        0     -245
Impacted Files Coverage Δ
cpp/src/arrow/vendored/datetime/date.h 26.35% <0%> (-4.28%) ⬇️
cpp/src/arrow/type.h 91.57% <0%> (-3.38%) ⬇️
python/pyarrow/lib.pyx 97.82% <0%> (-2.18%) ⬇️
cpp/src/arrow/ipc/writer.cc 93.04% <0%> (-1.15%) ⬇️
cpp/src/arrow/util/thread-pool-test.cc 97.66% <0%> (-0.94%) ⬇️
python/pyarrow/tests/test_flight.py 79.26% <0%> (-0.82%) ⬇️
cpp/src/arrow/python/helpers.cc 80.23% <0%> (-0.63%) ⬇️
cpp/src/arrow/array.cc 90.43% <0%> (-0.21%) ⬇️
python/pyarrow/types.pxi 69.24% <0%> (-0.13%) ⬇️
cpp/src/arrow/ipc/metadata-internal.cc 87.46% <0%> (-0.12%) ⬇️
... and 211 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25b4a46...d792e0a. Read the comment docs.

@fsaintjacques fsaintjacques changed the title [ARROW-5583][Java] When the isSet of a NullableValueHolder is 0, the buffer field should not be used ARROW-5583: [Java] When the isSet of a NullableValueHolder is 0, the buffer field should not be used Jun 14, 2019
@@ -242,10 +242,12 @@ public void set(int index, NullableVarBinaryHolder holder) {
assert index >= 0;
fillHoles(index);
BitVectorHelper.setValidityBit(validityBuffer, index, holder.isSet);
final int dataLength = holder.end - holder.start;
final int dataLength = holder.isSet == 0 ? 0 : holder.end - holder.start;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the arrow spec, when a variable width slot is null the length should be zero. What symptom were you seeing that this is trying to fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see this is on the write path.

I think it is clearer if you handle this the other way:

if (holder.isSet) {
/// set the bytes
}
lastSet = index;

Copy link
Contributor Author

@liyafan82 liyafan82 Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.
I have revised the code. Please see if it looks better.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is fixing the right thing, can you explain this some more?

@@ -242,10 +242,12 @@ public void set(int index, NullableVarBinaryHolder holder) {
assert index >= 0;
fillHoles(index);
BitVectorHelper.setValidityBit(validityBuffer, index, holder.isSet);
final int dataLength = holder.end - holder.start;
final int dataLength = holder.isSet == 0 ? 0 : holder.end - holder.start;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see this is on the write path.

I think it is clearer if you handle this the other way:

if (holder.isSet) {
/// set the bytes
}
lastSet = index;

@liyafan82
Copy link
Contributor Author

liyafan82 commented Jun 17, 2019

I'm not sure this is fixing the right thing, can you explain this some more?

Sorry, I should have made this clearer.

In our scenario, we often set a null value in the VarCharVector with the following code snippet:

NullableVarCharHolder holder = new NullableVarCharHolder();
holder.isSet = 0;
...
varCharVector.set(i, holder);

Please note that in the code above, the holder.buffer is not set, so it is null. According to the VarCharVector#set method, it will set the bytes using holder.buffer even if holder.isSet equals 0. This will lead to an exception.

@emkornfield
Copy link
Contributor

This change looks OK to me but I'm not sure if there might be performance concerns with introducing a branch here? @pravindra?

@liyafan82
Copy link
Contributor Author

This change looks OK to me but I'm not sure if there might be performance concerns with introducing a branch here? @pravindra?

@emkornfield , thanks for the consideration.

I think anyway, there will be performance penalty here. To remove the branch, the user needs to set NullableVarCharHolder#buffer to a valid ArrowBuf, even if the value is null. I do not think this makes much sense to the user.

If the user is sure that the value is not null, to avoid the performance penalty, he/she should use the set method with VarCharHolder as the parameter. There is no branch in that set method.

@pravindra, would you please give some comments?

@pravindra
Copy link
Contributor

The change lgtm.

I see the additional branch but I think it's useful if the holder.buffer in the null or if it has a non-zero length. @liyafan82 - isn't the same change required in setSafe() too ?

@liyafan82
Copy link
Contributor Author

The change lgtm.

I see the additional branch but I think it's useful if the holder.buffer in the null or if it has a non-zero length. @liyafan82 - isn't the same change required in setSafe() too ?

@pravindra thanks a lot for your kind reminder.
I have revised the code for setSafe accordingly.

@emkornfield
Copy link
Contributor

+1, LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants