Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for UUID comparison functions #10791

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

BryanCutler
Copy link

@BryanCutler BryanCutler commented Aug 20, 2024

This adds binary comparison functions <, >, <=, >= to the UUID custom data type. Equality functions are already present. Added unit tests for testing a query with comparisons between UUID values.

Also fixes the binary representation of UuidType to be a LE int128_t and adjusted Presto serde to put UUID values in the format that Presto Java expects, as 2 long values with msb first.

From #10584, prestodb/presto#23311

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 20, 2024
Copy link

netlify bot commented Aug 20, 2024

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit 5aa91d4
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67042c058c13de00080b4110
😎 Deploy Preview https://deploy-preview-10791--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler. One small comment.

Did you try running the queries in prestodb/presto#23301 with this logic ?

}
};

#define VELOX_GEN_BINARY_EXPR_UUID(Name, uuidCompExpr, TResult) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TResult parameter is not used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch! It's actually the same in VELOX_GEN_BINARY_EXPR_TIMESTAMP_WITH_TIME_ZONE too

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@BryanCutler
Copy link
Author

Did you try running the queries in prestodb/presto#23301 with this logic ?

Yes, and the queries for <,>,<=,>= now pass. But it is a little strange, they only pass for me in debug mode, when I run in release mode I get this type of error:

not equal
Actual rows (1 of 1 extra rows shown, 6 rows in total):
    [f768f36d-4f09-4da7-a298-3564d8f3c986, 1fdff429-6f65-4370-bb2a-7ab660dbec92]
Expected rows (1 of 1 missing rows shown, 6 rows in total):
    [1fdff429-6f65-4370-bb2a-7ab660dbec92, f768f36d-4f09-4da7-a298-3564d8f3c986]

where it looks like one UUID is randomly generated, and I'm not sure where it's coming from. I'll look into that a bit more.

@BryanCutler
Copy link
Author

@aditi-pandit I didn't realize the temp table from the test in prestodb/presto#23301 included a randomly generated UUID with (cast(uuid() AS VARCHAR)). That was the reason for failing in release mode, without this random value all the comparison tests pass.

Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler. Had some minor comments.

velox/functions/prestosql/UuidFunctions.h Outdated Show resolved Hide resolved
template <typename T> \
struct Name##Uuid : public UuidCompareBase { \
VELOX_DEFINE_FUNCTION_TYPES(T); \
FOLLY_ALWAYS_INLINE void call( \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Add newline in between.

Copy link
Author

@BryanCutler BryanCutler Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit I don't see a consistent format for this, most of the time it looks how it is currently (when the arguments don't fit within a single line) but I do see some cases where the format is:

template <typename T>
struct MultiplyVoidOutputFunction {
  template <typename TInput>
  FOLLY_ALWAYS_INLINE void
  call(TInput& result, const TInput& a, const TInput& b) {
    result = functions::multiply(a, b);
  }
};

Do you mean the latter, with a newline after void?

Copy link
Collaborator

@aditi-pandit aditi-pandit Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler : I meant newline in between

VELOX_DEFINE_FUNCTION_TYPES(T);      
                                    
FOLLY_ALWAYS_INLINE void call(      

You can stick with the formatting from make format-fix otherwise.

@@ -108,5 +108,45 @@ TEST_F(UuidFunctionsTest, unsupportedCast) {
evaluate("cast(123 as uuid())", input), "Cannot cast BIGINT to UUID.");
}

TEST_F(UuidFunctionsTest, comparisons) {
auto data0 = makeFlatVector<std::string>({
"33355449-2c7d-43d7-967a-f53cd23215ad",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add more tests where the lower 8 bytes are the same for better coverage of the comparison logic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, for the lower bits I have a check for <, = and I'll add one more for >

Copy link
Author

@BryanCutler BryanCutler Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some additional cases, one with equal lower bytes

@aditi-pandit aditi-pandit changed the title Add Support for UUID Comparison Functions Add support for UUID Comparison Functions Aug 26, 2024
@aditi-pandit aditi-pandit changed the title Add support for UUID Comparison Functions Add support for UUID Comparison functions Aug 26, 2024
Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler.

@aditi-pandit
Copy link
Collaborator

@Yuhta @kagamiori @mbasmanova : Please can you help with the review.

@BryanCutler
Copy link
Author

Gentle ping @Yuhta @kagamiori @mbasmanova to please review

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler Thank you for adding this functionality. I wonder where is the semantics of comparing UUIDs come from? Do you have a pointer? Would be nice to update PR description and documentation (https://facebookincubator.github.io/velox/develop/types.html) to clarify.

#include "velox/functions/prestosql/types/UuidType.h"

namespace facebook::velox::functions {

struct UuidCompareBase {
FOLLY_ALWAYS_INLINE int compare64(uint64_t x, uint64_t y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be static functions or free functions.

Let's also put this code into 'details' namespace.

}

int compare128(const int128_t& lhs, const int128_t& rhs) {
int comp = compare64(HugeInt::upper(lhs), HugeInt::upper(rhs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comp

Do not abbreviate: comp -> result or similar.

"33355449-2c7d-43d7-967a-f53cd23215ad",
});

auto actual = evaluate("cast(c0 as uuid) < cast(c1 as uuid)", makeRowVector({data0, data1}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason not to test one value at a time using evaluateOnce? This makes tests easier to read and debug.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just not familiar with the test environment. I will try with evaluateOnce.

@mbasmanova mbasmanova changed the title Add support for UUID Comparison functions Add support for UUID comparison functions Sep 9, 2024
@BryanCutler
Copy link
Author

Thanks for reviewing @mbasmanova

I wonder where is the semantics of comparing UUIDs come from? Do you have a pointer? Would be nice to update PR description and documentation

I basically followed the comparison done in in Presto Java https://github.com/prestodb/presto/blob/97d88cfdebb553b5c7fb0e217b0211573df3d190/presto-common/src/main/java/com/facebook/presto/common/type/UuidType.java#L97

Since int128_t doesn't have support for atomic operations, we first check the higher bytes, then lower. Returning an int from the comparison allows it to be streamlined to do minimal checks. I will update the PR and documentation.

@mbasmanova
Copy link
Contributor

@BryanCutler I wonder whether it makes sense to compare UUIDs at all? What does it mean that one UUID is greater than the other? Perhaps, this type should be non-orderable.

@BryanCutler
Copy link
Author

@BryanCutler I wonder whether it makes sense to compare UUIDs at all? What does it mean that one UUID is greater than the other? Perhaps, this type should be non-orderable.

@mbasmanova I think it makes sense, just as you would compare values for std::byte, etc. Most UUID implementations support such operations, see https://www.boost.org/doc/libs/1_85_0/libs/uuid/doc/uuid.html#Operators

@aditi-pandit
Copy link
Collaborator

@BryanCutler I wonder whether it makes sense to compare UUIDs at all? What does it mean that one UUID is greater than the other? Perhaps, this type should be non-orderable.

@mbasmanova I think it makes sense, just as you would compare values for std::byte, etc. Most UUID implementations support such operations, see https://www.boost.org/doc/libs/1_85_0/libs/uuid/doc/uuid.html#Operators

@mbasmanova : I found UUID comparisons in the Java spec also https://docs.oracle.com/javase/8/docs/api/java/util/UUID.html

My understanding is that a source generating UUIDs uses a combination of timestamp with the DNS address of the generating device. So there is some notion of ordering of UUIDs as a result.

@BryanCutler
Copy link
Author

@mbasmanova just wanted to confirm if you are ok with adding this functionality? I still need to update the tests per your request and will get to that soon, thanks!

@mbasmanova
Copy link
Contributor

@BryanCutler Yes, please, move ahead with addressing the comments. Thank you for clarifying the use case.

@BryanCutler
Copy link
Author

@mbasmanova @aditi-pandit after looking into prestodb/presto#23311 there are a couple issues with UUIDs in Velox:

  1. UUID data is being stored as big endian in the int128_t value, which comes from using boost:uuid during casting to/from strings. This causes comparisons like 11000000-0000-0022-0000-000000000000 < 22000000-0000-0011-0000-000000000000 to fail.
  2. Serialization of the int128_t value to an int128 Array Block results in [LSW, MSW], however the Presto Java UUIDType expects long values from the int128 Array Block to be [MSW, LSW].

For (1), since Velox stores UUID values with an int128_t, unlike boost::uuid which is a byte array, I think the value should be in little endian. This means instead of directly using memcpy bytes from the boost::uuid during casting, the bytes would need to swapped to get int128 in LE order. wdyt?

To fix (2), the word order would need to be switched during serialization. Would you prefer I fix that in this PR or open another?

@aditi-pandit
Copy link
Collaborator

aditi-pandit commented Sep 25, 2024

I feel the first fix should be to address is the issue of the incorrect results in prestodb/presto#23311. Post that we can do this comparisons PR.

From your comments seems that requires 2 changes:
i) Fixing the seriailization order between the native and java sides (2) above
ii) Changing the uuid int128_t to be little endian in the copy from boost::uuid.

@mbasmanova : Using boost::uuid (instead of int128_t) to represent UUID in Velox could be cleaner given these issues. We could make UUID TypeKind::VARBINARY and use the boost::uuid bytes to represent it. wdyt ?

@BryanCutler : If we did that, would we need the serialization changes from Native to Java side as well ? My intuition is yes, but I could be wrong.

@Yuhta : Jimmy, would be great to hear your thoughts as well.

@BryanCutler
Copy link
Author

Thanks @aditi-pandit , if we did change the storage from int128_t to a byte array in boost::uuid, then we would still need to adjust the serialization to Java, since Presto expects an Int128ArrayBlock.

@Yuhta
Copy link
Contributor

Yuhta commented Oct 1, 2024

@aditi-pandit I don't think VARBINARY is the right physical type for UUID, it is meant for variable length data, while UUID is fixed length. I would say just keep using int128 and fix the endianness so the comparison between UUID is the same as comparison between int128.

@aditi-pandit
Copy link
Collaborator

Thanks @Yuhta for your response. I do see a pattern of using VARBINARY for fixed lengths (in IPPREFIX type) but that has length 17 bytes which can't be captured with any integral type. For UUID we do have the option of using int128_t.

@BryanCutler : Let's go with Jimmy's suggestion. Will make our implementation simpler as well.

@BryanCutler
Copy link
Author

Thanks @Yuhta and @aditi-pandit , I have fixed the serialization issue here and added a unit test, I can put that in a separate PR if preferred. I will also test these changes again with prestodb/presto#23311

@BryanCutler
Copy link
Author

BryanCutler commented Oct 8, 2024

After testing the comparisons with Presto Java in prestodb/presto#23311 there are some different results because Presto Java uses Slice.compareTo() which compares values by swapping bytes of each long value, see https://github.com/airlift/slice/blob/0.38/src/main/java/io/airlift/slice/Slice.java#L1336. So the values are ordered according to the LSB of each long value, which seems a little strange and I'm not sure if that should be changed.

** moved the serializer fix to #11197, will update this pr soon
** updated - the underlying issue in Presto Java is an additional byte swap, not an overflow from https://bugs.openjdk.org/browse/JDK-7025832

return x < y ? -1 : (x == y ? 0 : 1);
}

static int compare128(const int128_t& lhs, const int128_t& rhs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just compare (uint128_t)lhs and (uint128_t)rhs

@BryanCutler
Copy link
Author

Just compare (uint128_t)lhs and (uint128_t)rhs

Thanks @Yuhta , I can look into that but there is a bigger concern that this type of ordering will not match Presto Java, see above #10791 (comment).

Presto Java UuidOperations use Slice.compareTo() which follows a strange ordering based on swapping bytes. As far as I can tell this ordering was changed in the Slice library 2.0+, which uses Arrays.compareUnsigned and I don't believe it would have the same problem, but not sure upgrading that is planned in the near-term.

I'm not sure if we should try to follow Presto Java with the comparisons or use the spec at RFC-4122 to order lexicographically, e.g. comparing uint128_t in little endian, and differ from Presto Java. Please share recommendations on this @Yuhta @aditi-pandit

@BryanCutler
Copy link
Author

There is a recent related PR cherry-picked from trindodb, this might address some part of the Presto Java issue prestodb/presto#17939.

@Yuhta
Copy link
Contributor

Yuhta commented Oct 16, 2024

@BryanCutler We should ask Presto Java community to decide the ordering of UUID and document them officially first.

@BryanCutler
Copy link
Author

@BryanCutler We should ask Presto Java community to decide the ordering of UUID and document them officially first.

There is a PR up at prestodb/presto#23847 that corrects Presto Java to conform to the IETF RFC 4122 definition of ordering UUIDs. When that is resolved, this PR can be updated to match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants