-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6355: [Java] Make range equal visitor reusable #5195
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, mostly looks good to me, some minor comments.
@@ -121,9 +126,9 @@ protected boolean compareStructVectors(NonNullableStructVector left) { | |||
return false; | |||
} | |||
|
|||
ApproxEqualsVisitor visitor = new ApproxEqualsVisitor(floatEpsilon, doubleEpsilon); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also do this for compareListVectors/compareFixedSizeLiseVectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems compareListVector/compareFixedSizeListVectors in this class does not change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Fixed now.
for (String name : left.getChildFieldNames()) { | ||
RangeEqualsVisitor visitor = new RangeEqualsVisitor(rightVector.getChild(name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for compareListVectors/compareFixedSizeListVectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Thanks for the good point.
for (int k = 0; k < leftChildren.size(); k++) { | ||
ApproxEqualsVisitor visitor = new ApproxEqualsVisitor(rightChildren.get(k), | ||
floatEpsilon, doubleEpsilon); | ||
visitor.set(rightChildren.get(k), 0, 0, rightChildren.get(k).getValueCount(), true); | ||
if (!leftChildren.get(k).accept(visitor, null)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we give a message if visitor is directly used without setting params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code to make leftStart, rightStart and length have -1 as default value. Since each visit method calls validate method first, if the set method is not called, an exception will be thrown.
9e04bc2
to
c38453d
Compare
c38453d
to
a1f7046
Compare
Codecov Report
@@ Coverage Diff @@
## master #5195 +/- ##
===========================================
+ Coverage 72.25% 89.7% +17.44%
===========================================
Files 837 693 -144
Lines 112023 104518 -7505
Branches 1437 0 -1437
===========================================
+ Hits 80946 93761 +12815
+ Misses 30715 10757 -19958
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
@pravindra I think you've been reviewing the Visitor changes, mind taking a look at this? |
for (String name : left.getChildFieldNames()) { | ||
ApproxEqualsVisitor visitor = new ApproxEqualsVisitor(rightVector.getChild(name), | ||
floatEpsilon, doubleEpsilon); | ||
visitor.set(rightVector.getChild(name), 0, 0, rightVector.getChild(name).getValueCount(), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are two other ways to pass along parameters
a. for parameters that do not change between invocation, can put them in the constructor of the vistor
b. for parameters that are different for each invocation, pass along as a pojo/object in the accept() call (instead of null).
did you consider these options ? The drawback of the current approach is that it's mandating a strict order between (visitor.set and the accept call) i.e the accept call must always be preceeded by the visitor.set call (not doing so will be caught at runtime only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pravindra Sounds good.
Wrapping changing parameters into a pojo is a good idea. Too many parameters may confuse users and easily produce bugs.
I have revised the code in response to @pravindra 's comments. Brief change list:
Despite above changes, I still think there are some problems with the current code. For example,
Please give some comments. |
@liyafan82 I'll review this over the weekend. |
@pravindra thank you so much for your effort. |
@liyafan82 the POJO is efficient to use but it's making it harder to follow code. From the use case you mentioned, we want to use the same visitor instance to compare different ranges. Given that, I think if the POJO has only the range (leftStart, rightStart, index), and the rest should be in the constructor. If we separate out the Range from the actual visitor, it makes the type/range validation also efficient.
@liyafan82 I tried this out on top of your change on top of your PR. Please see if this makes sense. |
@pravindra Thanks a lot for your effort. Your code looks good to me in general. By moving the vectors out of the range parameter, the code readability improves a lot. However, there are some small problems: the visitors for comparing UnionVector and StructVector cannot be reused. By weighing the benefits and costs, I generally prefer your your changes. @emkornfield What do you think? |
maybe, alloc these on first call and reuse ?
|
Sounds good for short vectors. To have one inner visitor and make it reusable, I think we should have an API for the visitor to override the left and right vectors. However, that would make the visitor mutable, which is not a good design. What do you think? |
What' s the overhead ? heap-memory or allocation calls ? I don't believe heap-memory is much of an overhead in this case. The number of allocation calls with this approach remains same with the single-use case (VectorEqualsVisitor) but reduces drastically in the multi reuse case (eg. dedup/dictionary).
agree with the design part. I'm not convinced of the benefit of efficiency either with making the visitor mutable - the mutability requires updating every field in the POJO for each child call, and doing type-checks repeatedly. That can be significantly higher than the allocation cost. |
@pravindra I agree with you that the overhead for heap-memory and allocation calls are insignificant. I also agree with you about the high cost of type-checks. Your method works well for StructVector. However, it may not work well for UnionVector. This is because leftChildren.get(k) and rightChildren.get(k) may get different vectors for different values of k. So the visitor cannot be reused. That being said, I think your method is already much better than the original implementation. Maybe we can solve the problems for UnionVector or StructVector in a later issue. What do you think? |
lgtm. thanks ! |
Move out Range from the other visitor params
The current implementation seems a little confused to me:
ii. Since we now need to pass both left vector and right vector, it's more like a comparator than a visitor, if in this way, I think all visitor API should be removed? |
not sure what you mean - we are still doing an accept with the visitor API. yes, It's essentially a comparator (it's used to compute VectorEquals or RangeEquals) - the fact that it internally a visitor pattern (rather than say, instanceof checks) doesn't need to be exposed to the users of the API. |
For example, in BaseFixedWidthVector we pass 'this' as leftVector:
However in RangeEqualsVisitor, this param is not used
|
Hi @tianchen92 , thanks for the good point. |
@liyafan82 Thanks for your feedback and I think i totally understand what you mean.
If we use accept API like below, it seems something duplicated for leftVector
More important, it could not prevent users using like this way which is not correct:
In this aspect, with the current implementation, the 'accept' api seems useless and not sure if we should remove it. |
The end-result would be that it would still give the results of RangeEquals on vector1/vector2. but, it's harmless. The new semantics of RangeEqualsVisitor allows it to more efficient by reducing the number of type checks.
The accept/visit APIs provide the standard benefits of visitor
|
I see, thanks for clarify. |
@tianchen92 sorry, I thought some more on this and I think the case you pointed out is not actually harmless.
if vector1/vector2 are say, StructVectors and vector3 is an IntVector - things can go bad. we'll use the compareBaseFixedWidthVectors() and do wrong type-casts for vector1/vector2. can you please open a jira for this ? |
Sure, tracked in https://issues.apache.org/jira/browse/ARROW-6472 |
@tianchen92 Thank you for openning a JIRA for tracking this. IMO, the visitor is a good idea, which provides much flexibility. |
As discussed in #5195 (comment), there are some problems with the current ways of comparing floating point vectors, we solve them in this PR: 1. there are if statements/duplicated members in ApproxEqualsVisitor, making the code redundant and less clear. 2. the comparion of float4 and float8 are based on wrapped objects Float and Double, which may have performance penalty. Closes #5304 from liyafan82/fly_0905_float and squashes the following commits: 907c17d <liyafan82> Remove value boxing/unboxing for ApproxEqualsVisitor Authored-by: liyafan82 <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Related to [ARROW-6472](https://issues.apache.org/jira/browse/ARROW-6472). If we use visitor API this way: >RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2); vector3.accept(visitor, range) if vector1/vector2 are say, StructVector}}s and vector3 is an {{IntVector - things can go bad. we'll use the compareBaseFixedWidthVectors() and do wrong type-casts for vector1/vector2. Discussions see: #5195 (comment) https://issues.apache.org/jira/browse/ARROW-6472 Closes #5483 from tianchen92/ARROW-6472 and squashes the following commits: 3d3d295 <tianchen> add test 12e4aa2 <tianchen> ARROW-6472: ValueVector#accept may has potential cast exception Authored-by: tianchen <[email protected]> Signed-off-by: Pindikura Ravindra <[email protected]>
Related to [ARROW-6472](https://issues.apache.org/jira/browse/ARROW-6472). If we use visitor API this way: >RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2); vector3.accept(visitor, range) if vector1/vector2 are say, StructVector}}s and vector3 is an {{IntVector - things can go bad. we'll use the compareBaseFixedWidthVectors() and do wrong type-casts for vector1/vector2. Discussions see: apache#5195 (comment) https://issues.apache.org/jira/browse/ARROW-6472 Closes apache#5483 from tianchen92/ARROW-6472 and squashes the following commits: 3d3d295 <tianchen> add test 12e4aa2 <tianchen> ARROW-6472: ValueVector#accept may has potential cast exception Authored-by: tianchen <[email protected]> Signed-off-by: Pindikura Ravindra <[email protected]>
According to the discussion in apache#4993 (comment), we often encountered this scenario: we compare values repeatedly. The comparisons differs only in the parameters (vector to compare, start index, etc). According to the current API, we have to create a new RangeEqualVisitor object each time the comparison is performed. This leads to non-trivial performance overhead. To address this problem, we make the RangeEqualVisitor reusable, and allow the client to change parameters of an existing visitor. Closes apache#5195 from liyafan82/fly_0826_reuse and squashes the following commits: ffe0e6a <liyafan82> Merge pull request #1 from pravindra/pull-5195 073bc78 <Pindikura Ravindra> Test: Move out Range from the visitor params 7482414 <liyafan82> Wrapper visit parameters into a pojo 53c1e0b <liyafan82> Merge branch 'master' into fly_0826_reuse a1f7046 <liyafan82> Make range equal visitor reusable Lead-authored-by: liyafan82 <[email protected]> Co-authored-by: Pindikura Ravindra <[email protected]> Co-authored-by: liyafan82 <[email protected]> Signed-off-by: Pindikura Ravindra <[email protected]>
As discussed in apache/arrow#5195 (comment), there are some problems with the current ways of comparing floating point vectors, we solve them in this PR: 1. there are if statements/duplicated members in ApproxEqualsVisitor, making the code redundant and less clear. 2. the comparion of float4 and float8 are based on wrapped objects Float and Double, which may have performance penalty. Closes #5304 from liyafan82/fly_0905_float and squashes the following commits: 907c17d08 <liyafan82> Remove value boxing/unboxing for ApproxEqualsVisitor Authored-by: liyafan82 <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Related to [ARROW-6472](https://issues.apache.org/jira/browse/ARROW-6472). If we use visitor API this way: >RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2); vector3.accept(visitor, range) if vector1/vector2 are say, StructVector}}s and vector3 is an {{IntVector - things can go bad. we'll use the compareBaseFixedWidthVectors() and do wrong type-casts for vector1/vector2. Discussions see: apache/arrow#5195 (comment) https://issues.apache.org/jira/browse/ARROW-6472 Closes #5483 from tianchen92/ARROW-6472 and squashes the following commits: 3d3d29530 <tianchen> add test 12e4aa200 <tianchen> ARROW-6472: ValueVector#accept may has potential cast exception Authored-by: tianchen <[email protected]> Signed-off-by: Pindikura Ravindra <[email protected]>
According to the discussion in #4993 (comment), we often encountered this scenario: we compare values repeatedly. The comparisons differs only in the parameters (vector to compare, start index, etc).
According to the current API, we have to create a new RangeEqualVisitor object each time the comparison is performed. This leads to non-trivial performance overhead.
To address this problem, we make the RangeEqualVisitor reusable, and allow the client to change parameters of an existing visitor.