-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add java string hash, hive string hash, for compatible with java world #3811
Conversation
Int32 h = 0; | ||
for (int i = 0; i < (int)size; ++i) | ||
{ | ||
h = 31 * h + data[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signed integer overflow is undefined behaviour in C++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can accumulate in unsigned integer and then cast to signed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using unsigned int and then cast, may get different result with java code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Int32 for accumulation with overflow is just illegal in C++ (in contrast to Java).
* the java string hash implement, | ||
* many system from java world use this string hash function or based it | ||
*/ | ||
class FunctionJavaHash : public IFunction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't use FunctionIntHash template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FunctionIntHash is a hash form 'int -> int', but JavaHash is 'string -> int'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't use FunctionAnyHash template?
static Int32 apply(const char * data, const size_t size) | ||
{ | ||
Int32 h = 0; | ||
for (int i = 0; i < (int)size; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we use signed index?
And this code lack support for > 2GB strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In java, they use Int32 type, so, the hash value can less than 0, just for compatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the type of loop variable.
} | ||
return h; | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it real, that Java still uses such a poor hash function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a common fast implementation for "==" operator in Java. If two objects has the same hashcode, then method object1.equals(object2) is called. E.g. this is a fast part of contract, that allows not to perform deep comparison of two complicated objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexey-milovidov yes, it's poor, but used wildly in java world
@@ -0,0 +1,4 @@ | |||
select JavaHash('abc'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to have names starting with lowercase and with number of bits (javaHash32
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Int32 h = 0; | ||
for (int i = 0; i < (int)size; ++i) | ||
{ | ||
h = 31 * h + data[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using data[i]
is incorrect, because signed-ness of char
may be different on different platforms (x86, ARM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how to fix this problem? can we just simply transform data[i] to (int)data[i]?
@@ -956,6 +956,136 @@ class FunctionURLHash : public IFunction | |||
} | |||
}; | |||
|
|||
struct JavaHashImpl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add a link for reference in comment.
I will try to address all my concerns and merge this PR. |
That's cool |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
The hash function used by java, and hive 'distributed' clause, people can use these function to generate the same hash distribution as hive/java system after they loading some data already hashed by hive/java system, so, 'co-located join' / 'local in' can be used to speed up query.
Unfortunately, this function only works for ascii-string, which is '0x0' - '0x7F', because in java, char is 2 bytes, with 'unicode' encoding, for ascii char '0x01', its unicode value '0x0001' = '0x01', but for some chinese code,
we can't simply use it as 1 byte char in C. For some userId, or deviceId, it's enough, but you can't use it for a general java string.