-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using msgpack.encode instead tostring in hash function #207
Comments
I totally agree, good catch, thanks for reporting it. Unfortunately, I can't change the existing |
Yes, agree with you, let's introduce such function that uses msgpack instead of tostring |
However, it is also worth considering that numbers in msgpack can be encoded differently. |
Could you please give an example? Without explicit casts I assume it's safe tarantool> msgpack.encode(1)
---
- "\x01"
...
tarantool> msgpack.encode(1ULL)
---
- "\x01"
...
tarantool> msgpack.encode(1LL)
---
- "\x01"
...
tarantool> msgpack.encode(ffi.cast('double', 1))
---
- !!binary yz/wAAAAAAAA
...
tarantool> tostring(ffi.cast('double', 1))
---
- 'cdata<double>: 0x0102d885c8'
... And without losing the precision of the number, that's obvious.
|
My plan was to call |
Closes #207 @TarantoolBot document Title: vshard.router.bucket_id_strcrc32() and .bucket_id_mpcrc32() vshard.router.bucket_id() is deprecated, each its usage logs a warning. It still works, but will be deleted in future. Behaviour of the old bucket_id() function is now available as vshard.router.bucket_id_strcrc32(). It works exactly like the old function, but does not log a warning. The reason why there is a new function bucket_id_mpcrc32() is that the old bucket_id() and the new bucket_id_strcrc32() are not consistent for cdata numbers. In particular, they return 3 different values for normal Lua numbers like 123, for unsigned long long cdata (like 123ULL, or ffi.cast('unsigned long long', 123)), and for signed long long cdata (like 123LL, or ffi.cast('long long', 123)). Note, this is important! vshard.router.bucket_id(123) vshard.router.bucket_id(123LL) vshard.router.bucket_id(123ULL) Return 3 different values!!! For float and double cdata (ffi.cast('float', number), ffi.cast('double', number)) these functions return different values even for the same numbers of the same floating point type. This is because tostring() on a floating point cdata number returns not the number, but a pointer at it. Different on each call. vshard.router.bucket_id_strcrc32() behaves exactly the same. vshard.router.bucket_id_mpcrc32() is safer. It takes a CRC32 from MessagePack encoded value. That is, bucket_id of integers does not depend on their Lua type. However it still may return different values for not equal floating point types. That is, ffi.cast('float', number) may be reflected onto a bucket id not equal to ffi.cast('double', number). This can't be fixed, because a float value, even being casted to double, may have a garbage tail in its fraction. Floating point keys should not be used to calculate a bucket id, usually. A final note - bucket_id_mpcrc32() in case of a string key does not encode it into MessagePack, but takes hash right from the string. This does not affect consistency of the function, but makes it as fast as bucket_id_strcrc32().
Whoever is interested in participation of the new API and behaviour development: join https://lists.tarantool.org/pipermail/tarantool-patches/2020-February/014451.html thread. |
Closes #207 @TarantoolBot document Title: vshard.router.bucket_id_strcrc32() and .bucket_id_mpcrc32() vshard.router.bucket_id() is deprecated, each its usage logs a warning. It still works, but will be deleted in future. Behaviour of the old bucket_id() function is now available as vshard.router.bucket_id_strcrc32(). It works exactly like the old function, but does not log a warning. The reason why there is a new function bucket_id_mpcrc32() is that the old bucket_id() and the new bucket_id_strcrc32() are not consistent for cdata numbers. In particular, they return 3 different values for normal Lua numbers like 123, for unsigned long long cdata (like 123ULL, or ffi.cast('unsigned long long', 123)), and for signed long long cdata (like 123LL, or ffi.cast('long long', 123)). Note, this is important! vshard.router.bucket_id(123) vshard.router.bucket_id(123LL) vshard.router.bucket_id(123ULL) Return 3 different values!!! For float and double cdata (ffi.cast('float', number), ffi.cast('double', number)) these functions return different values even for the same numbers of the same floating point type. This is because tostring() on a floating point cdata number returns not the number, but a pointer at it. Different on each call. vshard.router.bucket_id_strcrc32() behaves exactly the same, but does not log a warning. In case you need that behaviour. vshard.router.bucket_id_mpcrc32() is safer. It takes a CRC32 from MessagePack encoded value. That is, bucket_id of integers does not depend on their Lua type. However it still may return different values for not equal floating point types. That is, ffi.cast('float', number) may be reflected onto a bucket id not equal to ffi.cast('double', number). This can't be fixed, because a float value, even being casted to double, may have a garbage tail in its fraction. Floating point keys should not be used to calculate a bucket id, usually. P.S. #1: bucket_id_mpcrc32() in case of a string key does not encode it into MessagePack, but takes hash right from the string. This does not affect consistency of the function, but makes it as fast as bucket_id_strcrc32(). P.S. #2: be very careful in case you store floating point types in a space. When data is returned from a space, it is cased to Lua number. And if that value had empty fraction part, it will be treated as integer by bucket_id_mpcrc32(). So you need to do explicit casts in such cases. Example of the problem: s = box.schema.create_space('test', {format = {{'id', 'double'}}}) _ = s:create_index('pk') inserted = ffi.cast('double', 1) -- Value is stored as double. s:replace({inserted}) -- But when returned to Lua, stored as Lua number, not cdata. returned = s:get({inserted}).id type(returned), returned --- - number - 1 ... vshard.router.bucket_id_mpcrc32(inserted) --- - 1411 ... vshard.router.bucket_id_mpcrc32(returned) --- - 1614 ...
What's wrong with tostring:
Possible consequences:
Hashes calculated for "922337203685477580ULL" and "922337203685477580LL" are different. This could lead to some unexpected results for users. (I understand that tarantool internally uses msgpuck that stores any positive integer as unsigned)
Without breaking of backward compatibility I suggest to introduce new hash function that will be stable for such cases. Or at least document such behaviour.
The text was updated successfully, but these errors were encountered: