Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add java string hash, hive string hash, for compatible with java world #3811

Merged
merged 5 commits into from
Dec 18, 2018

Conversation

shangshujie365
Copy link
Contributor

@shangshujie365 shangshujie365 commented Dec 11, 2018

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

The hash function used by java, and hive 'distributed' clause, people can use these function to generate the same hash distribution as hive/java system after they loading some data already hashed by hive/java system, so, 'co-located join' / 'local in' can be used to speed up query.

Unfortunately, this function only works for ascii-string, which is '0x0' - '0x7F', because in java, char is 2 bytes, with 'unicode' encoding, for ascii char '0x01', its unicode value '0x0001' = '0x01', but for some chinese code,
we can't simply use it as 1 byte char in C. For some userId, or deviceId, it's enough, but you can't use it for a general java string.

Int32 h = 0;
for (int i = 0; i < (int)size; ++i)
{
h = 31 * h + data[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signed integer overflow is undefined behaviour in C++

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can accumulate in unsigned integer and then cast to signed.

Copy link
Contributor Author

@shangshujie365 shangshujie365 Dec 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using unsigned int and then cast, may get different result with java code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Int32 for accumulation with overflow is just illegal in C++ (in contrast to Java).

* the java string hash implement,
* many system from java world use this string hash function or based it
*/
class FunctionJavaHash : public IFunction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't use FunctionIntHash template?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FunctionIntHash is a hash form 'int -> int', but JavaHash is 'string -> int'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't use FunctionAnyHash template?

static Int32 apply(const char * data, const size_t size)
{
Int32 h = 0;
for (int i = 0; i < (int)size; ++i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we use signed index?
And this code lack support for > 2GB strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In java, they use Int32 type, so, the hash value can less than 0, just for compatible

Copy link
Member

@alexey-milovidov alexey-milovidov Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the type of loop variable.

}
return h;
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it real, that Java still uses such a poor hash function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a common fast implementation for "==" operator in Java. If two objects has the same hashcode, then method object1.equals(object2) is called. E.g. this is a fast part of contract, that allows not to perform deep comparison of two complicated objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexey-milovidov yes, it's poor, but used wildly in java world

@@ -0,0 +1,4 @@
select JavaHash('abc');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have names starting with lowercase and with number of bits (javaHash32).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Int32 h = 0;
for (int i = 0; i < (int)size; ++i)
{
h = 31 * h + data[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using data[i] is incorrect, because signed-ness of char may be different on different platforms (x86, ARM).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to fix this problem? can we just simply transform data[i] to (int)data[i]?

@@ -956,6 +956,136 @@ class FunctionURLHash : public IFunction
}
};

struct JavaHashImpl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add a link for reference in comment.

@alexey-milovidov
Copy link
Member

I will try to address all my concerns and merge this PR.

@shangshujie365
Copy link
Contributor Author

I will try to address all my concerns and merge this PR.

That's cool

@alexey-milovidov alexey-milovidov merged commit c811170 into ClickHouse:master Dec 18, 2018
alexey-milovidov added a commit that referenced this pull request Dec 18, 2018
@alexey-milovidov
Copy link
Member

d5d1c34

alexey-milovidov added a commit that referenced this pull request Dec 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants