Add native UTF-8 Validation using fast shift based DFA #47880

ndinsmore · 2022-12-12T22:29:01Z

This is a based on the discussion in #41533. It is a julia implementation of a shift based DFA implemented with inspiration from golang/go#47120.

*** Edit the benchmarks have been updated using the code found in ndinsmore/StringBenchmarks***

Throughput improvement: small strings -> large strings

ASCII: 1.4x -> 20x
single nonASCII: 1.1x -> 10x
Unicode: 1.2x -> 2.4x

Master:

"isvalid" => 4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "length 2:64" => Trial(1.873 ms)
	  "length 512:4096" => Trial(636.962 μs)
	  "length 64:512" => Trial(733.949 μs)
	  "length 4096:32768" => Trial(585.777 μs)
  "julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "files" => Trial(4.258 ms)
	  "lines" => Trial(7.950 ms)
	  "files SubString" => Trial(4.094 ms)
	  "lines SubString" => Trial(7.714 ms)
  "Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "length 2:64" => Trial(1.991 ms)
	  "length 512:4096" => Trial(1.588 ms)
	  "length 64:512" => Trial(1.658 ms)
	  "length 4096:32768" => Trial(1.561 ms)
  "ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "length 2:64" => Trial(1.123 ms)
	  "length 512:4096" => Trial(703.426 μs)
	  "length 64:512" => Trial(750.386 μs)
	  "length 4096:32768" => Trial(730.985 μs)

This PR:

  "isvalid" =>4-element BenchmarkTools.BenchmarkGroup:
    tags: []
    "single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "length 2:64" => Trial(1.757 ms)
	    "length 512:4096" => Trial(164.487 μs)
	    "length 64:512" => Trial(659.963 μs)
	    "length 4096:32768" => Trial(62.674 μs)
    "julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "files" => Trial(1.025 ms)
	    "lines" => Trial(5.355 ms)
	    "files SubString" => Trial(870.075 μs)
	    "lines SubString" => Trial(5.262 ms)
    "Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "length 2:64" => Trial(1.610 ms)
	    "length 512:4096" => Trial(671.044 μs)
	    "length 64:512" => Trial(840.627 μs)
	    "length 4096:32768" => Trial(654.662 μs)
    "ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "length 2:64" => Trial(794.437 μs)
	    "length 512:4096" => Trial(55.955 μs)
	    "length 64:512" => Trial(177.169 μs)
	    "length 4096:32768" => Trial(37.669 μs)

base/strings/string.jl

ndinsmore · 2022-12-13T00:58:02Z

Right now this isn't working, working to resolve test failures

ndinsmore · 2022-12-13T05:21:39Z

Just a further update, it looks like the state table from the reference I had is wrong. I am working on rebuilding it should be ready tomorrow.

ndinsmore · 2022-12-13T17:47:25Z

This is now passing all test on my system I don't know why buildkite is failing, could someone relaunch the test for me?

gbaraldi · 2022-12-13T17:49:47Z

Github is broken :)

base/strings/string.jl

ndinsmore · 2022-12-13T18:59:15Z

Is the PR the right place to document the build process for the state machine table?

oscardssmith · 2022-12-13T19:00:00Z

I'd put the table information in a comment in the code.

ndinsmore · 2022-12-13T23:54:30Z

I think with the final documentation of the methodology it is ready to go.

base/strings/string.jl

ndinsmore · 2023-01-06T22:10:21Z

@stevengj & @StefanKarpinski (sorry to drag you into this, but I think you did a lot of the initial strings work)
Some of the requested changes(which are 100% correct) beg the question whether this should really be using The codeunits interface rather than unsafe_wrap.

Initial testing seemed to indicate that CodeUnits is slower, but I wonder if it would be wise to do it the correct way here. Then work on speeding up CodeUnits array.

StefanKarpinski · 2023-01-06T22:31:25Z

Happy to be pulled in! It certainly seems cleaner to use codeunits. @JeffBezanson, any idea why that approach would be less efficient?

ndinsmore · 2023-01-10T16:24:22Z

The implementation has changed to using CodeUnits instead of unsafe_wrap now. Once the GC.@preserve had been added to the unsafe_wrap calls, the benchmarks showed that CodeUnits was now faster and on par with the benchmarks before the GC changes were made.

ndinsmore · 2023-01-11T17:32:23Z

@stevengj, @StefanKarpinski & @oscardssmith
There have been a good amount of changes since the initial PR, but I think it is ready to go.

I have reversed or order of operations on the shift dfa which leads to state that is never dirty (ie. you would have to state & UInt64(63) before use) and provided a performance improvement.

Also the use of codeunits has dramatically cleaned up the code and made the implementation more maintainable.

With this PR the processing rate is about 0.5 bytes / cycle for short strings (word to line length) and 1.5 bytes/cycle for longer strings ( file length). This is on par (short) to better (long) than the c implementation.

There are two PRs I plan to follow up with:

Broader use of the DFA throughout strings.jl which should simplify the implementation of length,iterate, nextind, and basically anything else that parses through the utf-8 bytes.
A heuristical to approach to short vs long strings. The current implementation can likely improve short strings 50%, but for long strings, it is fairly easy to get to 10 bytes/cycle for utf-8 and even better 50 bytes/cycle for ascii only strings.

Let me know if there is anything else that needs to be done to this.

base/strings/substring.jl

ndinsmore · 2023-01-23T17:36:39Z

Everything seems ready to go.

StefanKarpinski · 2023-01-27T22:39:35Z

This is great work. Now that the implementation has moved to Julia, I think we need much more comprehensive tests of the functionality. Previously we could just assume that byte_string_classify was correct and simply verify that we use it correctly—basically try one ASCII, one non-ASCII UTF-8 and one invalid string. Now that the implementation is here, we need to have comprehensive tests for it, ideally ones that test all the edge cases of this algorithm, which you're in the best position at this point to understand.

ndinsmore · 2023-02-06T13:15:21Z

Now that everything goes the the DFA would you agree that if the tests validate the DFA, we can assume byte_string_classify is validated?

ndinsmore · 2023-02-06T16:25:23Z

Test have been added to validate that the DFA state machine returns state as expected and per the Unicode spec.

ndinsmore · 2023-04-06T14:56:26Z

@mkitti The rebase is done at this point, but thank you.

Does anyone know why these tests are failing?

mkitti · 2023-04-06T15:00:01Z

I'm guessing the test failures is unrelated. Could be a problem on master or maybe CI.

@oscardssmith do we want to wait until master can pass tests, or is this mergeable now?

oscardssmith · 2023-04-06T15:10:57Z

They look unrelated. I'm happy to merge as is if no one objects in the next day or two.

ndinsmore · 2023-04-06T15:29:54Z

@mkitti & @oscardssmith this is a little off topic, but when I rebase I normally just merge master to my local master and rebase to that. I have had bad luck recently in picking masters that seem to fail tests. So the question is there a tagged master which is always passing all tests?

oscardssmith · 2023-04-06T15:48:03Z

not really. in theory, master is supposed to pass tests, but in practice that sometimes doesn't happen.

mkitti · 2023-04-06T22:01:50Z

Master does seem to be passing again... we should we rerun the failing test?

base/strings/string.jl

Co-authored-by: Steven G. Johnson <[email protected]>

base/strings/string.jl

…nto native_utf8_validation

ndinsmore · 2023-04-11T15:58:31Z

Is the REPL failure related to this? I am sure it is text heavy but I don't see any of the changed functions in this PR returning a Union?

oscardssmith · 2023-04-11T16:10:45Z

I'm slightly scared by this so I'm re-running CI.

ndinsmore · 2023-04-12T16:00:54Z

Looks like it cleared

StefanKarpinski · 2023-04-12T21:07:54Z

Exciting times! Was considering how to review this, but it's well tested and I think we'll notice if string stuff is broken.

* Working Native UTF-8 Validation --------- Co-authored-by: Oscar Smith <[email protected]> Co-authored-by: Steven G. Johnson <[email protected]>

oscardssmith reviewed Dec 12, 2022

View reviewed changes

base/strings/string.jl Outdated Show resolved Hide resolved

oscardssmith reviewed Dec 13, 2022

View reviewed changes

base/strings/string.jl Outdated Show resolved Hide resolved

oscardssmith reviewed Dec 13, 2022

View reviewed changes

base/strings/string.jl Outdated Show resolved Hide resolved

sclel016 reviewed Dec 14, 2022

View reviewed changes

base/strings/string.jl Outdated Show resolved Hide resolved

ndinsmore mentioned this pull request Dec 15, 2022

Use shift based DFA for utf8 validation #41533

Closed

giordano reviewed Dec 15, 2022

View reviewed changes

base/strings/string.jl Outdated Show resolved Hide resolved

brenhinkeller added strings "Strings!" performance Must go faster unicode Related to unicode characters and encodings labels Dec 18, 2022

stevengj reviewed Jan 4, 2023

View reviewed changes

base/strings/string.jl Outdated Show resolved Hide resolved

stevengj reviewed Jan 4, 2023

View reviewed changes

base/strings/string.jl Outdated Show resolved Hide resolved

stevengj reviewed Jan 4, 2023

View reviewed changes

base/strings/string.jl Outdated Show resolved Hide resolved

oscardssmith reviewed Jan 11, 2023

View reviewed changes

base/strings/substring.jl Outdated Show resolved Hide resolved

StefanKarpinski added the needs tests Unit tests are required for this change label Jan 27, 2023

thomcc mentioned this pull request Feb 7, 2023

Rewrite libcore's UTF-8 validation for performance rust-lang/rust#107760

Closed

5 tasks

ndinsmore added 7 commits April 6, 2023 09:39

Changed DFA to track isascii & added state diagram

b446597

Fix states discription

b989d0d

Add tests to validate DFA

2795250

Trailing newline

bc0e662

State to UInt32 & use SMTsolver derived shifts

557bda6

Add Chunk based byte_string_classify

aaf8863

Chunk based processing

8006d60

ndinsmore force-pushed the native_utf8_validation branch from 935c46d to 8006d60 Compare April 6, 2023 13:40

Merge branch 'master' into native_utf8_validation

cd467dd

stevengj reviewed Apr 7, 2023

View reviewed changes

base/strings/string.jl Outdated Show resolved Hide resolved

Update base/strings/string.jl

a6e3383

Co-authored-by: Steven G. Johnson <[email protected]>

oscardssmith reviewed Apr 7, 2023

View reviewed changes

base/strings/string.jl Outdated Show resolved Hide resolved

ndinsmore and others added 3 commits April 10, 2023 11:49

Change State Arrays to a matrix

d1a129c

Merge branch 'native_utf8_validation' of github.com:ndinsmore/julia i…

331f05d

…nto native_utf8_validation

Merge branch 'master' into native_utf8_validation

661ec93

Merge branch 'master' into native_utf8_validation

d456837

oscardssmith merged commit b554e8f into JuliaLang:master Apr 12, 2023

This was referenced Jun 23, 2023

MethodError: isvalid(::String31) JuliaStrings/InlineStrings.jl#60

Closed

use native isvalid in 1.10 JuliaStrings/StringViews.jl#20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native UTF-8 Validation using fast shift based DFA #47880

Add native UTF-8 Validation using fast shift based DFA #47880

ndinsmore commented Dec 12, 2022 •

edited

Loading

ndinsmore commented Dec 13, 2022

ndinsmore commented Dec 13, 2022

ndinsmore commented Dec 13, 2022

gbaraldi commented Dec 13, 2022

ndinsmore commented Dec 13, 2022

oscardssmith commented Dec 13, 2022

ndinsmore commented Dec 13, 2022 •

edited

Loading

ndinsmore commented Jan 6, 2023

StefanKarpinski commented Jan 6, 2023

ndinsmore commented Jan 10, 2023

ndinsmore commented Jan 11, 2023

ndinsmore commented Jan 23, 2023

StefanKarpinski commented Jan 27, 2023

ndinsmore commented Feb 6, 2023

ndinsmore commented Feb 6, 2023

ndinsmore commented Apr 6, 2023

mkitti commented Apr 6, 2023 •

edited

Loading

oscardssmith commented Apr 6, 2023

ndinsmore commented Apr 6, 2023

oscardssmith commented Apr 6, 2023

mkitti commented Apr 6, 2023

ndinsmore commented Apr 11, 2023

oscardssmith commented Apr 11, 2023

ndinsmore commented Apr 12, 2023

StefanKarpinski commented Apr 12, 2023

Add native UTF-8 Validation using fast shift based DFA #47880

Add native UTF-8 Validation using fast shift based DFA #47880

Conversation

ndinsmore commented Dec 12, 2022 • edited Loading

ndinsmore commented Dec 13, 2022

ndinsmore commented Dec 13, 2022

ndinsmore commented Dec 13, 2022

gbaraldi commented Dec 13, 2022

ndinsmore commented Dec 13, 2022

oscardssmith commented Dec 13, 2022

ndinsmore commented Dec 13, 2022 • edited Loading

ndinsmore commented Jan 6, 2023

StefanKarpinski commented Jan 6, 2023

ndinsmore commented Jan 10, 2023

ndinsmore commented Jan 11, 2023

ndinsmore commented Jan 23, 2023

StefanKarpinski commented Jan 27, 2023

ndinsmore commented Feb 6, 2023

ndinsmore commented Feb 6, 2023

ndinsmore commented Apr 6, 2023

mkitti commented Apr 6, 2023 • edited Loading

oscardssmith commented Apr 6, 2023

ndinsmore commented Apr 6, 2023

oscardssmith commented Apr 6, 2023

mkitti commented Apr 6, 2023

ndinsmore commented Apr 11, 2023

oscardssmith commented Apr 11, 2023

ndinsmore commented Apr 12, 2023

StefanKarpinski commented Apr 12, 2023

ndinsmore commented Dec 12, 2022 •

edited

Loading

ndinsmore commented Dec 13, 2022 •

edited

Loading

mkitti commented Apr 6, 2023 •

edited

Loading