Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using msgpack.encode instead tostring in hash function #207

Closed
olegrok opened this issue Dec 24, 2019 · 6 comments
Closed

Consider using msgpack.encode instead tostring in hash function #207

olegrok opened this issue Dec 24, 2019 · 6 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@olegrok
Copy link
Contributor

olegrok commented Dec 24, 2019

What's wrong with tostring:

tarantool> tostring(922337203685477580LL)
---
- 922337203685477580LL
...

tarantool> tostring(922337203685477580ULL)
---
- 922337203685477580ULL
...

Possible consequences:

tarantool> box.space.test:insert({922337203685477580LL})
---
- [922337203685477580]
...

tarantool> tostring(box.space.test:get({922337203685477580LL})[1])
---
- 922337203685477580ULL
...

Hashes calculated for "922337203685477580ULL" and "922337203685477580LL" are different. This could lead to some unexpected results for users. (I understand that tarantool internally uses msgpuck that stores any positive integer as unsigned)

Without breaking of backward compatibility I suggest to introduce new hash function that will be stable for such cases. Or at least document such behaviour.

@Gerold103 Gerold103 added the bug Something isn't working label Feb 19, 2020
@Gerold103 Gerold103 self-assigned this Feb 19, 2020
@Gerold103 Gerold103 added this to the 0.2 milestone Feb 19, 2020
@Gerold103
Copy link
Collaborator

Gerold103 commented Feb 22, 2020

I totally agree, good catch, thanks for reporting it. Unfortunately, I can't change the existing vshard.router.bucket_id() function, because it will corrupt existing users, but I can deprecate it, and introduce a new function. Then the old one will be deleted after some time. I am going to call a new one vshard.router.bucket_id_mpcrc32() - this will explicitly say, that it uses MessagePack and crc32. In case more functions will be added, I will just use different suffixes for them.
Sounds ok?

@olegrok
Copy link
Contributor Author

olegrok commented Feb 22, 2020

Yes, agree with you, let's introduce such function that uses msgpack instead of tostring

@R-omk
Copy link
Contributor

R-omk commented Feb 22, 2020

However, it is also worth considering that numbers in msgpack can be encoded differently.

@olegrok
Copy link
Contributor Author

olegrok commented Feb 22, 2020

However, it is also worth considering that numbers in msgpack can be encoded differently.

Could you please give an example?

Without explicit casts I assume it's safe

tarantool> msgpack.encode(1)
---
- "\x01"
...

tarantool> msgpack.encode(1ULL)
---
- "\x01"
...

tarantool> msgpack.encode(1LL)
---
- "\x01"
...

tarantool> msgpack.encode(ffi.cast('double', 1))
---
- !!binary yz/wAAAAAAAA
...
tarantool> tostring(ffi.cast('double', 1))
---
- 'cdata<double>: 0x0102d885c8'
...

And without losing the precision of the number, that's obvious.

tarantool> msgpack.encode(1844674407370955161LL)
---
- !!binary zxmZmZmZmZmZ
...
tarantool> msgpack.encode(1844674407370955161ULL)
---
- !!binary zxmZmZmZmZmZ
...
tarantool> msgpack.encode(1844674407370955161)
---
- !!binary zxmZmZmZmZoA
...
tarantool> 1844674407370955161
---
- 1844674407370955264
...

@Gerold103
Copy link
Collaborator

tarantool> msgpack.encode(ffi.cast('float', 1))
---
- !!binary yj+AAAA=
...

tarantool> msgpack.encode(ffi.cast('double', 1))
---
- !!binary yz/wAAAAAAAA
...

tarantool> msgpack.encode(1)
---
- "\x01"
...

My plan was to call tonumber() on cdata numbers before passing them to encode.
Now I already think, that I can fix the existing bucket_id() in the same way - just call tonumber(), if see any cdata number. It won't make worse, because currently bucket_id() is just broken for such values. Assuming, that it was always used for normal numbers only, tonumber() won't break existing users, but will make it safe.

Gerold103 added a commit that referenced this issue Feb 25, 2020
Closes #207

@TarantoolBot document
Title: vshard.router.bucket_id_strcrc32() and .bucket_id_mpcrc32()
vshard.router.bucket_id() is deprecated, each its usage logs a
warning. It still works, but will be deleted in future.

Behaviour of the old bucket_id() function is now available as
vshard.router.bucket_id_strcrc32(). It works exactly like the old
function, but does not log a warning.

The reason why there is a new function bucket_id_mpcrc32() is that
the old bucket_id() and the new bucket_id_strcrc32() are not
consistent for cdata numbers. In particular, they return 3
different values for normal Lua numbers like 123, for unsigned
long long cdata (like 123ULL, or ffi.cast('unsigned long long',
123)), and for signed long long cdata (like 123LL, or
ffi.cast('long long', 123)). Note, this is important!

    vshard.router.bucket_id(123)
    vshard.router.bucket_id(123LL)
    vshard.router.bucket_id(123ULL)

    Return 3 different values!!!

For float and double cdata (ffi.cast('float', number),
ffi.cast('double', number)) these functions return different
values even for the same numbers of the same floating point type.
This is because tostring() on a floating point cdata number
returns not the number, but a pointer at it. Different on each
call.

vshard.router.bucket_id_strcrc32() behaves exactly the same.

vshard.router.bucket_id_mpcrc32() is safer. It takes a CRC32 from
MessagePack encoded value. That is, bucket_id of integers does not
depend on their Lua type. However it still may return different
values for not equal floating point types. That is,
ffi.cast('float', number) may be reflected onto a bucket id not
equal to ffi.cast('double', number). This can't be fixed, because
a float value, even being casted to double, may have a garbage
tail in its fraction.

Floating point keys should not be used to calculate a bucket id,
usually.

A final note - bucket_id_mpcrc32() in case of a string key does
not encode it into MessagePack, but takes hash right from the
string. This does not affect consistency of the function, but
makes it as fast as bucket_id_strcrc32().
@Gerold103
Copy link
Collaborator

Whoever is interested in participation of the new API and behaviour development: join https://lists.tarantool.org/pipermail/tarantool-patches/2020-February/014451.html thread.

Gerold103 added a commit that referenced this issue Feb 29, 2020
Closes #207

@TarantoolBot document
Title: vshard.router.bucket_id_strcrc32() and .bucket_id_mpcrc32()
vshard.router.bucket_id() is deprecated, each its usage logs a
warning. It still works, but will be deleted in future.

Behaviour of the old bucket_id() function is now available as
vshard.router.bucket_id_strcrc32(). It works exactly like the old
function, but does not log a warning.

The reason why there is a new function bucket_id_mpcrc32() is that
the old bucket_id() and the new bucket_id_strcrc32() are not
consistent for cdata numbers. In particular, they return 3
different values for normal Lua numbers like 123, for unsigned
long long cdata (like 123ULL, or ffi.cast('unsigned long long',
123)), and for signed long long cdata (like 123LL, or
ffi.cast('long long', 123)). Note, this is important!

    vshard.router.bucket_id(123)
    vshard.router.bucket_id(123LL)
    vshard.router.bucket_id(123ULL)

    Return 3 different values!!!

For float and double cdata (ffi.cast('float', number),
ffi.cast('double', number)) these functions return different
values even for the same numbers of the same floating point type.
This is because tostring() on a floating point cdata number
returns not the number, but a pointer at it. Different on each
call.

vshard.router.bucket_id_strcrc32() behaves exactly the same, but
does not log a warning. In case you need that behaviour.

vshard.router.bucket_id_mpcrc32() is safer. It takes a CRC32 from
MessagePack encoded value. That is, bucket_id of integers does not
depend on their Lua type. However it still may return different
values for not equal floating point types. That is,
ffi.cast('float', number) may be reflected onto a bucket id not
equal to ffi.cast('double', number). This can't be fixed, because
a float value, even being casted to double, may have a garbage
tail in its fraction.

Floating point keys should not be used to calculate a bucket id,
usually.

P.S. #1: bucket_id_mpcrc32() in case of a string key does not
encode it into MessagePack, but takes hash right from the string.
This does not affect consistency of the function, but makes it as
fast as bucket_id_strcrc32().

P.S. #2: be very careful in case you store floating point types in
a space. When data is returned from a space, it is cased to Lua
number. And if that value had empty fraction part, it will be
treated as integer by bucket_id_mpcrc32(). So you need to do
explicit casts in such cases. Example of the problem:

s = box.schema.create_space('test', {format = {{'id', 'double'}}})
_ = s:create_index('pk')

inserted = ffi.cast('double', 1)

-- Value is stored as double.
s:replace({inserted})

-- But when returned to Lua, stored as Lua number, not cdata.
returned = s:get({inserted}).id
type(returned), returned
---
- number
- 1
...

vshard.router.bucket_id_mpcrc32(inserted)
---
- 1411
...
vshard.router.bucket_id_mpcrc32(returned)
---
- 1614
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants