-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle ZWJ and emoji sequences, don't break identifiers within graphemes #372
Conversation
I'm not sure if we also need to update the Scheme parser as well? A similar patch for the Scheme parser would be diff --git a/src/flisp/julia_extensions.c b/src/flisp/julia_extensions.c
index f29e397275..fa04339b5d 100644
--- a/src/flisp/julia_extensions.c
+++ b/src/flisp/julia_extensions.c
@@ -178,6 +178,7 @@ JL_DLLEXPORT int jl_op_suffix_char(uint32_t wc)
static int never_id_char(uint32_t wc)
{
utf8proc_category_t cat = utf8proc_category((utf8proc_int32_t) wc);
+ if (wc == 0x200c || wc == 0x200d) return 0; // ZWNJ/ZWJ is allowed in emoji, despite being in category Cf
return (
// spaces and control characters:
(cat >= UTF8PROC_CATEGORY_ZS && cat <= UTF8PROC_CATEGORY_CS) ||
@@ -329,6 +330,8 @@ value_t fl_accum_julia_symbol(fl_context_t *fl_ctx, value_t *args, uint32_t narg
if (!iscprim(args[0]) || ((cprim_t*)ptr(args[0]))->type != fl_ctx->wchartype)
type_error(fl_ctx, "accum-julia-symbol", "wchar", args[0]);
uint32_t wc = *(uint32_t*)cp_data((cprim_t*)ptr(args[0])); // peek the first character we'll read
+ uint32_t prev_wc = wc;
+ utf8proc_int32_t grapheme_state = 0;
ios_t str;
int allascii = 1;
ios_mem(&str, 0);
@@ -345,9 +348,10 @@ value_t fl_accum_julia_symbol(fl_context_t *fl_ctx, value_t *args, uint32_t narg
}
allascii &= (wc <= 0x7f);
ios_pututf8(&str, wc);
+ prev_wc = wc;
if (safe_peekutf8(fl_ctx, s, &wc) == IOS_EOF)
break;
- } while (jl_id_char(wc));
+ } while (jl_id_char(wc) || !utf8proc_grapheme_break_stateful(prev_wc, wc, &grapheme_state));
ios_pututf8(&str, 0);
return symbol(fl_ctx, allascii ? str.buf : normalize(fl_ctx, str.buf));
} although this doesn't implement the rule of forbidding ZWJ at the end of an identifier. |
Cool this looks sensible. I worry about the extra cost of calling into the C code per character for such a relatively hot (presumably) code path. What does the script in test/benchmark.jl say about performance impact? Personally I'm not worried about keeping the scheme parser perfectly up to date. Primarily it only needs to be able to parse Base until we figure out the bootstrapping story and can eventually delete it. (Side note... we really should bring |
Codecov Report
@@ Coverage Diff @@
## main #372 +/- ##
==========================================
- Coverage 96.58% 96.57% -0.01%
==========================================
Files 14 14
Lines 4160 4179 +19
==========================================
+ Hits 4018 4036 +18
- Misses 142 143 +1
|
Here are the results of
It looks like a 15% slowdown on ParseStream, a 10% slowdown on |
As a more realistic data point, here is the cost to parse all of base (Julia 1.9.2) with the following code: using JuliaSyntax, BenchmarkTools
include(dirname(pathof(JuliaSyntax)) * "/../test/test_utils.jl")
function bench_parse_all_in_path(basedir)
for filepath in find_source_in_path(basedir)
text = read(filepath, String)
stream = ParseStream(text)
parse!(stream)
ex = build_tree(Expr, stream, filename=filepath)
end
end
p = joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "base")
p = isdir(p) ? p : joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "src", "base")
@btime bench_parse_all_in_path($p) which gives:
My conclusion is that the slowdown in realistic workloads will be negligible, especially since parsing is usually only a very small portion of load time. |
I don't love these slowdowns - we could live with them of course, but it's very significant for such a small change and it does show this is a relatively hot code path (no surprise I guess). Luckily there's probably a super simple optimization: if you throw in an |
Another thing - while 3% is somewhat insignificant in building |
Just hardcore the answer for ascii characters? That feels like is should solve 99% of the perf issues. |
Following the suggestion by @c42f and @Keno, I added an optimization for the common case of ASCII identifiers. The minimum times are now (on a different computer than above):
Voilá, no performance penalty! (It's actually very slighly faster, since we no longer call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for sorting out the performance hit (and in fact, making the performance better!) I feel like this could benefit from another couple of test cases but other than that looks great to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes we do test other invisible chars elsewhere, thanks for the reminder 😅
Merge when ready :)
I suppose a |
Closes JuliaLang/julia#40071 — never break an identifier within a grapheme boundary. (Identifiers and graphemes may still only start within the pre-existing allowed set of characters.)
This allows us to accept identifiers like
🏳️🌈
==🏳️[ZWJ]🌈
which contain the U+200D "Zero-width joiner" (ZWJ) character, which was not previously allowed since it's in category Cf (Other, format). However, to prevent incomplete graphemes ending with an invisible ZWJ character, we disallow ZWJ at the end of an identifier; similarly for U+200C "Zero-width non-joiner" (ZWNJ).