handle ZWJ and emoji sequences, don't break identifiers within graphemes #372

stevengj · 2023-10-24T14:24:21Z

Closes JuliaLang/julia#40071 — never break an identifier within a grapheme boundary. (Identifiers and graphemes may still only start within the pre-existing allowed set of characters.)

This allows us to accept identifiers like 🏳️‍🌈 == 🏳️‍[ZWJ]🌈 which contain the U+200D "Zero-width joiner" (ZWJ) character, which was not previously allowed since it's in category Cf (Other, format). However, to prevent incomplete graphemes ending with an invisible ZWJ character, we disallow ZWJ at the end of an identifier; similarly for U+200C "Zero-width non-joiner" (ZWNJ).

stevengj · 2023-10-24T14:25:26Z

I'm not sure if we also need to update the Scheme parser as well? A similar patch for the Scheme parser would be

diff --git a/src/flisp/julia_extensions.c b/src/flisp/julia_extensions.c
index f29e397275..fa04339b5d 100644
--- a/src/flisp/julia_extensions.c
+++ b/src/flisp/julia_extensions.c
@@ -178,6 +178,7 @@ JL_DLLEXPORT int jl_op_suffix_char(uint32_t wc)
 static int never_id_char(uint32_t wc)
 {
      utf8proc_category_t cat = utf8proc_category((utf8proc_int32_t) wc);
+     if (wc == 0x200c || wc == 0x200d) return 0; // ZWNJ/ZWJ is allowed in emoji, despite being in category Cf
      return (
           // spaces and control characters:
           (cat >= UTF8PROC_CATEGORY_ZS && cat <= UTF8PROC_CATEGORY_CS) ||
@@ -329,6 +330,8 @@ value_t fl_accum_julia_symbol(fl_context_t *fl_ctx, value_t *args, uint32_t narg
     if (!iscprim(args[0]) || ((cprim_t*)ptr(args[0]))->type != fl_ctx->wchartype)
         type_error(fl_ctx, "accum-julia-symbol", "wchar", args[0]);
     uint32_t wc = *(uint32_t*)cp_data((cprim_t*)ptr(args[0])); // peek the first character we'll read
+    uint32_t prev_wc = wc;
+    utf8proc_int32_t grapheme_state = 0;
     ios_t str;
     int allascii = 1;
     ios_mem(&str, 0);
@@ -345,9 +348,10 @@ value_t fl_accum_julia_symbol(fl_context_t *fl_ctx, value_t *args, uint32_t narg
         }
         allascii &= (wc <= 0x7f);
         ios_pututf8(&str, wc);
+        prev_wc = wc;
         if (safe_peekutf8(fl_ctx, s, &wc) == IOS_EOF)
             break;
-    } while (jl_id_char(wc));
+    } while (jl_id_char(wc) || !utf8proc_grapheme_break_stateful(prev_wc, wc, &grapheme_state));
     ios_pututf8(&str, 0);
     return symbol(fl_ctx, allascii ? str.buf : normalize(fl_ctx, str.buf));
 }

although this doesn't implement the rule of forbidding ZWJ at the end of an identifier.

c42f · 2023-10-26T04:19:44Z

Cool this looks sensible. I worry about the extra cost of calling into the C code per character for such a relatively hot (presumably) code path. What does the script in test/benchmark.jl say about performance impact?

Personally I'm not worried about keeping the scheme parser perfectly up to date. Primarily it only needs to be able to parse Base until we figure out the bootstrapping story and can eventually delete it.

(Side note... we really should bring Base.is_id_char() into JuliaSyntax at some point --- this library does aim to be installable as a normal package on both old and new versions of Julia and behave the same everywhere...)

codecov · 2023-10-26T12:53:11Z

Codecov Report

Merging #372 (75a1fc2) into main (1b048aa) will decrease coverage by 0.01%.
Report is 2 commits behind head on main.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #372      +/-   ##
==========================================
- Coverage   96.58%   96.57%   -0.01%     
==========================================
  Files          14       14              
  Lines        4160     4179      +19     
==========================================
+ Hits         4018     4036      +18     
- Misses        142      143       +1

Files	Coverage Δ
src/tokenize.jl	`99.08% <100.00%> (+0.01%)`	⬆️

... and 2 files with indirect coverage changes

stevengj · 2023-10-26T13:05:33Z

Here are the results of test/benchmark.jl:

main branch, minimum times:

ParseStream: 205.944 ms
GreenNode: 214.803 ms
SyntaxNode: 403.870 ms
Expr: 472.283 ms
flisp: 3.644 s

sgj/ZWJ branch:

ParseStream: 179.254 ms
GreenNode: 241.707 ms
SyntaxNode: 427.242 ms
Expr: 496.298 ms
flisp: 3.652 s

It looks like a 15% slowdown on ParseStream, a 10% slowdown on GreenNode, and a 5% slowdown on SyntaxNode?

stevengj · 2023-10-26T13:43:33Z

As a more realistic data point, here is the cost to parse all of base (Julia 1.9.2) with the following code:

using JuliaSyntax, BenchmarkTools
include(dirname(pathof(JuliaSyntax)) * "/../test/test_utils.jl")

function bench_parse_all_in_path(basedir)
    for filepath in find_source_in_path(basedir)
        text = read(filepath, String)
        stream = ParseStream(text)
        parse!(stream)
        ex = build_tree(Expr, stream, filename=filepath)
    end
end

p = joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "base")
p = isdir(p) ? p : joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "src", "base") 

@btime bench_parse_all_in_path($p)

which gives:

main branch: 543.924 ms (2117659 allocations: 139.46 MiB)
sgj/ZWJ branch: 558.836 ms (2719985 allocations: 148.77 MiB)
i.e. a slowdown of < 3%.

My conclusion is that the slowdown in realistic workloads will be negligible, especially since parsing is usually only a very small portion of load time.

c42f · 2023-10-27T12:34:28Z

I don't love these slowdowns - we could live with them of course, but it's very significant for such a small change and it does show this is a relatively hot code path (no surprise I guess).

Luckily there's probably a super simple optimization: if you throw in an isascii() before you hit the slow path of isgraphemebreak!() it's probably almost free for most code?

c42f · 2023-10-27T12:36:42Z

Another thing - while 3% is somewhat insignificant in building Base, and the slowdown for GreenNode and SyntaxNode may seem irrelevant it's not exactly: it's quite relevant to surface-level tooling which works across a whole registry - for example package analyzer.

Keno · 2023-10-27T16:20:14Z

Just hardcore the answer for ascii characters? That feels like is should solve 99% of the perf issues.

stevengj · 2023-10-27T20:57:31Z

Following the suggestion by @c42f and @Keno, I added an optimization for the common case of ASCII identifiers. The minimum times are now (on a different computer than above):

sgj/ZWJ branch:

ParseStream: 114.662 ms ms
GreenNode: 134.884 ms
SyntaxNode: 247.248 ms
Expr: 300.598 ms
flisp: 2.306 s

main branch:

ParseStream: 115.894 ms
GreenNode: 136.842 ms
SyntaxNode: 252.208 ms
Expr: 304.795 ms
flisp: 2.342 s

Voilá, no performance penalty! (It's actually very slighly faster, since we no longer call is_identifier_char at all for ASCII chars, avoiding a lookup of the Unicode category.)

c42f

Awesome, thanks for sorting out the performance hit (and in fact, making the performance better!) I feel like this could benefit from another couple of test cases but other than that looks great to me.

test/tokenize.jl

src/tokenize.jl

c42f

Oh yes we do test other invisible chars elsewhere, thanks for the reminder 😅

Merge when ready :)

stevengj · 2023-11-01T12:27:40Z

I suppose a julia/NEWS.md item will be added when the JuliaSyntax version is bumped in the julia repo? Is there a place to make a note of that in the meantime?

handle ZWJ and emoji sequences

4859a29

stevengj added enhancement New feature or request tokenizer speculative labels Oct 24, 2023

forbid ZWNJ at end

bfcac85

fix tests on Julia < 1.5

d22e2c7

ascii fast path

584d363

stevengj force-pushed the sgj/ZWJ branch from 58f99a5 to 584d363 Compare October 27, 2023 20:50

fix for earlier Julia versions

8f6dc09

c42f reviewed Oct 31, 2023

View reviewed changes

test/tokenize.jl Outdated Show resolved Hide resolved

src/tokenize.jl Show resolved Hide resolved

Update test/tokenize.jl

75a1fc2

c42f approved these changes Nov 1, 2023

View reviewed changes

c42f mentioned this pull request Nov 1, 2023

EOF_CHAR isn't the common case, moving those first #365

Open

stevengj merged commit 05c594b into main Nov 1, 2023
32 checks passed

stevengj deleted the sgj/ZWJ branch November 1, 2023 12:26

stevengj mentioned this pull request Nov 1, 2023

eliminate dependency on Base.is_id_char and julia-specific utf8proc #377

Open

c42f mentioned this pull request Jul 23, 2024

Release 1.0.0 #472

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle ZWJ and emoji sequences, don't break identifiers within graphemes #372

handle ZWJ and emoji sequences, don't break identifiers within graphemes #372

stevengj commented Oct 24, 2023 •

edited

Loading

stevengj commented Oct 24, 2023 •

edited

Loading

c42f commented Oct 26, 2023 •

edited

Loading

codecov bot commented Oct 26, 2023 •

edited

Loading

stevengj commented Oct 26, 2023 •

edited

Loading

stevengj commented Oct 26, 2023 •

edited

Loading

c42f commented Oct 27, 2023

c42f commented Oct 27, 2023 •

edited

Loading

Keno commented Oct 27, 2023

stevengj commented Oct 27, 2023 •

edited

Loading

c42f left a comment

c42f left a comment

stevengj commented Nov 1, 2023 •

edited

Loading

handle ZWJ and emoji sequences, don't break identifiers within graphemes #372

handle ZWJ and emoji sequences, don't break identifiers within graphemes #372

Conversation

stevengj commented Oct 24, 2023 • edited Loading

stevengj commented Oct 24, 2023 • edited Loading

c42f commented Oct 26, 2023 • edited Loading

codecov bot commented Oct 26, 2023 • edited Loading

Codecov Report

stevengj commented Oct 26, 2023 • edited Loading

stevengj commented Oct 26, 2023 • edited Loading

c42f commented Oct 27, 2023

c42f commented Oct 27, 2023 • edited Loading

Keno commented Oct 27, 2023

stevengj commented Oct 27, 2023 • edited Loading

c42f left a comment

Choose a reason for hiding this comment

c42f left a comment

Choose a reason for hiding this comment

stevengj commented Nov 1, 2023 • edited Loading

stevengj commented Oct 24, 2023 •

edited

Loading

stevengj commented Oct 24, 2023 •

edited

Loading

c42f commented Oct 26, 2023 •

edited

Loading

codecov bot commented Oct 26, 2023 •

edited

Loading

stevengj commented Oct 26, 2023 •

edited

Loading

stevengj commented Oct 26, 2023 •

edited

Loading

c42f commented Oct 27, 2023 •

edited

Loading

stevengj commented Oct 27, 2023 •

edited

Loading

stevengj commented Nov 1, 2023 •

edited

Loading