Fix UUID comparisons to conform to IETF RFC 4122 #23847

ZacBlanco · 2024-10-16T19:18:32Z

Description

Reverse UUIDs bytes internally to make comparison operators conform to IETF RFC 4122

Motivation and Context

The presto documentation states that we support UUIDs and conform to RFC 4122¹:

Before this change, UUIDs were read in as two longs in big endian format and used the that byte order for comparisons. However, the java bytewise comparison in our io.airlift.slice dependency assumes the backing values are in little endian format, so the bytes are swapped during the comparison². This made comparisons operators between UUIDs incorrect according the RFC 4122³ §3 under "Rules for Lexical Equivalence" on p.5.

UUIDs, as defined in this document, can also be ordered lexicographically.
For a pair of UUIDs, the first one follows the second if the most significant
field in which the UUIDs differ is greater for the first UUID. The second
follows the first if the most significant field in which the UUIDs differ
is greater for the second UUID.

Note that RFC 4122³ has an errata in the paragraph describing the lexicographic comparison in which the original text is inconsistent. The corrected text can be found in the errata EID 1428⁴ and is reproduced above for easier reference.

Additionally, RFC 9562 has been published which, supersedes 4122. It defines the sorting rules as a simple byte-wise comparison in §6.11⁵

UUIDv6 and UUIDv7 are designed so that implementations that require sorting (e.g., database indexes) sort as opaque raw bytes without the need for parsing or introspection.

For example, before this change, the following comparison of UUIDs would result in a TRUE result:

presto:tpch> SELECT UUID '00000000-0000-0000-1000-000000000000' < UUID '00000000-0000-0000-0000-000000000001';
 _col0
-------
 true
(1 row)

This seems to be incorrect because the reading of the 00000000-0000-0000-1000-000000000000 UUID (say, UUID "A") appears to have a 1 byte in a more significant position than the 1 byte in the UUID 00000000-0000-0000-0000-000000000001 (say, UUID "B"). Because A has a 1 byte in a more significant position than B, this comparison should evaluate to FALSE.

In addition, when testing this same comparison in postgres (for which we also support the UUID type), postgres returns results which are inconsistent with Presto.

postgres=# SELECT uuid '00000000-0000-0000-1000-000000000000' < uuid '00000000-0000-0000-0000-000000000001';
 ?column?
----------
 f
(1 row)

Additional verification on ordering from postgres

SELECT * FROM (VALUES  
    (uuid '00000000-0000-0000-0000-000000000000'),
    (uuid '00000000-0000-0000-0000-000000000001'),
    (uuid '00000000-0000-0000-0000-000000000010'),
    (uuid '00000000-0000-0000-0000-000000000100'),
    (uuid '00000000-0000-0000-0000-000000001000'),
    (uuid '00000000-0000-0000-0000-000000010000'),
    (uuid '00000000-0000-0000-0000-000000100000'),
    (uuid '00000000-0000-0000-0000-000001000000'),
    (uuid '00000000-0000-0000-0000-000010000000'),
    (uuid '00000000-0000-0000-0000-000100000000'),
    (uuid '00000000-0000-0000-0000-001000000000'),
    (uuid '00000000-0000-0000-0000-010000000000'),
    (uuid '00000000-0000-0000-0000-100000000000'),
    (uuid '00000000-0000-0000-0001-000000000000'),
    (uuid '00000000-0000-0000-0010-000000000000'),
    (uuid '00000000-0000-0000-0100-000000000000'),
    (uuid '00000000-0000-0000-1000-000000000000'),
    (uuid '00000000-0000-0001-0000-000000000000'),
    (uuid '00000000-0000-0010-0000-000000000000'),
    (uuid '00000000-0000-0100-0000-000000000000'),
    (uuid '00000000-0000-1000-0000-000000000000'),
    (uuid '00000000-0001-0000-0000-000000000000'),
    (uuid '00000000-0010-0000-0000-000000000000'),
    (uuid '00000000-0100-0000-0000-000000000000'),
    (uuid '00000000-1000-0000-0000-000000000000'),
    (uuid '00000001-0000-0000-0000-000000000000'),
    (uuid '00000010-0000-0000-0000-000000000000'),
    (uuid '00000100-0000-0000-0000-000000000000'),
    (uuid '00001000-0000-0000-0000-000000000000'),
    (uuid '00010000-0000-0000-0000-000000000000'),
    (uuid '00100000-0000-0000-0000-000000000000'),
    (uuid '01000000-0000-0000-0000-000000000000'),
    (uuid '10000000-0000-0000-0000-000000000000')
) as t(u) order by u desc;

result

                 u
--------------------------------------
 10000000-0000-0000-0000-000000000000
 01000000-0000-0000-0000-000000000000
 00100000-0000-0000-0000-000000000000
 00010000-0000-0000-0000-000000000000
 00001000-0000-0000-0000-000000000000
 00000100-0000-0000-0000-000000000000
 00000010-0000-0000-0000-000000000000
 00000001-0000-0000-0000-000000000000
 00000000-1000-0000-0000-000000000000
 00000000-0100-0000-0000-000000000000
 00000000-0010-0000-0000-000000000000
 00000000-0001-0000-0000-000000000000
 00000000-0000-1000-0000-000000000000
 00000000-0000-0100-0000-000000000000
 00000000-0000-0010-0000-000000000000
 00000000-0000-0001-0000-000000000000
 00000000-0000-0000-1000-000000000000
 00000000-0000-0000-0100-000000000000
 00000000-0000-0000-0010-000000000000
 00000000-0000-0000-0001-000000000000
 00000000-0000-0000-0000-100000000000
 00000000-0000-0000-0000-010000000000
 00000000-0000-0000-0000-001000000000
 00000000-0000-0000-0000-000100000000
 00000000-0000-0000-0000-000010000000
 00000000-0000-0000-0000-000001000000
 00000000-0000-0000-0000-000000100000
 00000000-0000-0000-0000-000000010000
 00000000-0000-0000-0000-000000001000
 00000000-0000-0000-0000-000000000100
 00000000-0000-0000-0000-000000000010
 00000000-0000-0000-0000-000000000001
 00000000-0000-0000-0000-000000000000

Impact

Fixes comparisons to conform to the UUID RFC. Technically breaks compatibility between versions as comparisons now behave differently

Test Plan

Additional test cases for comparisons

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* Improve UUID comparisons so they conform to `IETF RFC 4122 <https://datatracker.ietf.org/doc/html/rfc4122>`_. :pr:`23847`

steveburnett · 2024-10-16T19:59:13Z

Thanks for the release note entry! A couple of nit suggestions to follow the phrasing in the Order of changes in the Release Notes Guidelines, add the PR number, and consider adding a link to the RFC. I found this link I used in the suggestion, if you have a better link please use it instead.

== RELEASE NOTES ==

General Changes
* Improve UUID comparisons so they are now performed correctly according to `IETF RFC 4122 <https://datatracker.ietf.org/doc/html/rfc4122>`_. :pr:`23847`

ZacBlanco · 2024-10-17T04:40:19Z

Thanks as usual for your suggestions @steveburnett 😄

tdcmeehan self-assigned this Oct 16, 2024

ZacBlanco force-pushed the upstream-uuid-big-endian branch from 347c794 to e66c3dc Compare October 16, 2024 19:26

ZacBlanco force-pushed the upstream-uuid-big-endian branch from e66c3dc to 1c0504e Compare October 16, 2024 21:25

use big-endian representation for uuids

ffbccc5

ZacBlanco force-pushed the upstream-uuid-big-endian branch from 1c0504e to ffbccc5 Compare October 16, 2024 21:59

ZacBlanco marked this pull request as ready for review October 17, 2024 05:27

ZacBlanco requested review from elharo and a team as code owners October 17, 2024 05:28

ZacBlanco requested a review from presto-oss October 17, 2024 05:28

ZacBlanco changed the title ~~Use big-endian representation for UUIDs~~ Fix UUID comparisons to conform to IETF RFC 4122 Oct 17, 2024

ZacBlanco mentioned this pull request Oct 17, 2024

Result mismatches between Presto java and native for UUID types #23311

Open

BryanCutler mentioned this pull request Oct 21, 2024

Add support for UUID comparison functions facebookincubator/velox#10791

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UUID comparisons to conform to IETF RFC 4122 #23847

Fix UUID comparisons to conform to IETF RFC 4122 #23847

ZacBlanco commented Oct 16, 2024 •

edited

Loading

steveburnett commented Oct 16, 2024

ZacBlanco commented Oct 17, 2024

Fix UUID comparisons to conform to IETF RFC 4122 #23847

Are you sure you want to change the base?

Fix UUID comparisons to conform to IETF RFC 4122 #23847

Conversation

ZacBlanco commented Oct 16, 2024 • edited Loading

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Footnotes

steveburnett commented Oct 16, 2024

ZacBlanco commented Oct 17, 2024

ZacBlanco commented Oct 16, 2024 •

edited

Loading