Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm] Pattern matching involving unicode #10720

Closed
kevinresol opened this issue Jun 3, 2022 · 4 comments
Closed

[jvm] Pattern matching involving unicode #10720

kevinresol opened this issue Jun 3, 2022 · 4 comments
Assignees
Labels
platform-jvm Everything related to JVM

Comments

@kevinresol
Copy link
Contributor

v4.2.5

function foo() return '😀 😀';

switch (foo()) {
	case '😀 😀':
		trace('yo');
	case v:
		trace('meh');
}

this prints "yo" in nodejs and "meh" in jvm

@Simn Simn self-assigned this Jun 3, 2022
@Simn Simn added the platform-jvm Everything related to JVM label Jun 3, 2022
@Simn
Copy link
Member

Simn commented Jun 3, 2022

This might be about the hashing being used. IIRC we determine a hash at compile-time and compare it to the one from run-time.

@kevinresol
Copy link
Contributor Author

Java treats each emoji to have length 2. So I guess that's where the issue comes from:

function foo() return '😀 😀';
function bar() return 'abc';
function baz() return '名 字';

function main() {
	trace(hashCode(foo()), ((cast foo():java.lang.Object).hashCode()), foo().length);
	trace(hashCode(bar()), ((cast bar():java.lang.Object).hashCode()), bar().length);
	trace(hashCode(baz()), ((cast baz():java.lang.Object).hashCode()), baz().length);
}

function hashCode(value:String) {
	var h = 0;
	
	if(value.length > 0) {
		for(i in 0...value.length) {
			h = 31 * h + value.charCodeAt(i);
		}
	}
	
	return h;
}

prints:

src/Main.hx:8: 1278630208, 1278630208, 5
src/Main.hx:9: 96354, 96354, 3
src/Main.hx:10: 20702212, 20702212, 3

@Aurel300
Copy link
Member

Aurel300 commented Jun 4, 2022

Java uses a modified UTF-8 encoding, described here. In particular, code points above U+FFFF are encoded using two surrogate code units instead of the standard 4-byte UTF-8 sequence representing a single code unit.

@Simn Simn closed this as completed in e239cf0 Jun 7, 2022
@Simn
Copy link
Member

Simn commented Jun 7, 2022

Fortunately, I found a java_hash implementation in genjava which deals with this stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform-jvm Everything related to JVM
Projects
None yet
Development

No branches or pull requests

3 participants