Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native UTF-8 Validation using fast shift based DFA #47880

Merged
merged 38 commits into from
Apr 12, 2023

Conversation

ndinsmore
Copy link
Contributor

@ndinsmore ndinsmore commented Dec 12, 2022

This is a based on the discussion in #41533. It is a julia implementation of a shift based DFA implemented with inspiration from golang/go#47120.

*** Edit the benchmarks have been updated using the code found in ndinsmore/StringBenchmarks***

Throughput improvement: small strings -> large strings

  • ASCII: 1.4x -> 20x
  • single nonASCII: 1.1x -> 10x
  • Unicode: 1.2x -> 2.4x

Master:

"isvalid" => 4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "length 2:64" => Trial(1.873 ms)
	  "length 512:4096" => Trial(636.962 μs)
	  "length 64:512" => Trial(733.949 μs)
	  "length 4096:32768" => Trial(585.777 μs)
  "julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "files" => Trial(4.258 ms)
	  "lines" => Trial(7.950 ms)
	  "files SubString" => Trial(4.094 ms)
	  "lines SubString" => Trial(7.714 ms)
  "Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "length 2:64" => Trial(1.991 ms)
	  "length 512:4096" => Trial(1.588 ms)
	  "length 64:512" => Trial(1.658 ms)
	  "length 4096:32768" => Trial(1.561 ms)
  "ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "length 2:64" => Trial(1.123 ms)
	  "length 512:4096" => Trial(703.426 μs)
	  "length 64:512" => Trial(750.386 μs)
	  "length 4096:32768" => Trial(730.985 μs)

This PR:

  "isvalid" =>4-element BenchmarkTools.BenchmarkGroup:
    tags: []
    "single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "length 2:64" => Trial(1.757 ms)
	    "length 512:4096" => Trial(164.487 μs)
	    "length 64:512" => Trial(659.963 μs)
	    "length 4096:32768" => Trial(62.674 μs)
    "julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "files" => Trial(1.025 ms)
	    "lines" => Trial(5.355 ms)
	    "files SubString" => Trial(870.075 μs)
	    "lines SubString" => Trial(5.262 ms)
    "Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "length 2:64" => Trial(1.610 ms)
	    "length 512:4096" => Trial(671.044 μs)
	    "length 64:512" => Trial(840.627 μs)
	    "length 4096:32768" => Trial(654.662 μs)
    "ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "length 2:64" => Trial(794.437 μs)
	    "length 512:4096" => Trial(55.955 μs)
	    "length 64:512" => Trial(177.169 μs)
	    "length 4096:32768" => Trial(37.669 μs)

@ndinsmore
Copy link
Contributor Author

Right now this isn't working, working to resolve test failures

@ndinsmore
Copy link
Contributor Author

Just a further update, it looks like the state table from the reference I had is wrong. I am working on rebuilding it should be ready tomorrow.

@ndinsmore
Copy link
Contributor Author

This is now passing all test on my system I don't know why buildkite is failing, could someone relaunch the test for me?

@gbaraldi
Copy link
Member

Github is broken :)

@ndinsmore
Copy link
Contributor Author

Is the PR the right place to document the build process for the state machine table?

@oscardssmith
Copy link
Member

I'd put the table information in a comment in the code.

@ndinsmore
Copy link
Contributor Author

ndinsmore commented Dec 13, 2022

I think with the final documentation of the methodology it is ready to go.

base/strings/string.jl Outdated Show resolved Hide resolved
base/strings/string.jl Outdated Show resolved Hide resolved
@brenhinkeller brenhinkeller added strings "Strings!" performance Must go faster unicode Related to unicode characters and encodings labels Dec 18, 2022
base/strings/string.jl Outdated Show resolved Hide resolved
base/strings/string.jl Outdated Show resolved Hide resolved
base/strings/string.jl Outdated Show resolved Hide resolved
@ndinsmore
Copy link
Contributor Author

@stevengj & @StefanKarpinski (sorry to drag you into this, but I think you did a lot of the initial strings work)
Some of the requested changes(which are 100% correct) beg the question whether this should really be using The codeunits interface rather than unsafe_wrap.

Initial testing seemed to indicate that CodeUnits is slower, but I wonder if it would be wise to do it the correct way here. Then work on speeding up CodeUnits array.

@StefanKarpinski
Copy link
Member

Happy to be pulled in! It certainly seems cleaner to use codeunits. @JeffBezanson, any idea why that approach would be less efficient?

@ndinsmore
Copy link
Contributor Author

The implementation has changed to using CodeUnits instead of unsafe_wrap now. Once the GC.@preserve had been added to the unsafe_wrap calls, the benchmarks showed that CodeUnits was now faster and on par with the benchmarks before the GC changes were made.

@ndinsmore
Copy link
Contributor Author

@stevengj, @StefanKarpinski & @oscardssmith
There have been a good amount of changes since the initial PR, but I think it is ready to go.

I have reversed or order of operations on the shift dfa which leads to state that is never dirty (ie. you would have to state & UInt64(63) before use) and provided a performance improvement.

Also the use of codeunits has dramatically cleaned up the code and made the implementation more maintainable.

With this PR the processing rate is about 0.5 bytes / cycle for short strings (word to line length) and 1.5 bytes/cycle for longer strings ( file length). This is on par (short) to better (long) than the c implementation.

There are two PRs I plan to follow up with:

  1. Broader use of the DFA throughout strings.jl which should simplify the implementation of length,iterate, nextind, and basically anything else that parses through the utf-8 bytes.
  2. A heuristical to approach to short vs long strings. The current implementation can likely improve short strings 50%, but for long strings, it is fairly easy to get to 10 bytes/cycle for utf-8 and even better 50 bytes/cycle for ascii only strings.

Let me know if there is anything else that needs to be done to this.

@ndinsmore
Copy link
Contributor Author

Everything seems ready to go.

@StefanKarpinski StefanKarpinski added the needs tests Unit tests are required for this change label Jan 27, 2023
@StefanKarpinski
Copy link
Member

This is great work. Now that the implementation has moved to Julia, I think we need much more comprehensive tests of the functionality. Previously we could just assume that byte_string_classify was correct and simply verify that we use it correctly—basically try one ASCII, one non-ASCII UTF-8 and one invalid string. Now that the implementation is here, we need to have comprehensive tests for it, ideally ones that test all the edge cases of this algorithm, which you're in the best position at this point to understand.

@ndinsmore
Copy link
Contributor Author

Now that everything goes the the DFA would you agree that if the tests validate the DFA, we can assume byte_string_classify is validated?

@ndinsmore
Copy link
Contributor Author

Test have been added to validate that the DFA state machine returns state as expected and per the Unicode spec.

@ndinsmore ndinsmore force-pushed the native_utf8_validation branch from 935c46d to 8006d60 Compare April 6, 2023 13:40
@ndinsmore
Copy link
Contributor Author

@mkitti The rebase is done at this point, but thank you.

Does anyone know why these tests are failing?

@mkitti
Copy link
Contributor

mkitti commented Apr 6, 2023

I'm guessing the test failures is unrelated. Could be a problem on master or maybe CI.

@oscardssmith do we want to wait until master can pass tests, or is this mergeable now?

@oscardssmith
Copy link
Member

They look unrelated. I'm happy to merge as is if no one objects in the next day or two.

@ndinsmore
Copy link
Contributor Author

@mkitti & @oscardssmith this is a little off topic, but when I rebase I normally just merge master to my local master and rebase to that. I have had bad luck recently in picking masters that seem to fail tests. So the question is there a tagged master which is always passing all tests?

@oscardssmith
Copy link
Member

not really. in theory, master is supposed to pass tests, but in practice that sometimes doesn't happen.

@mkitti
Copy link
Contributor

mkitti commented Apr 6, 2023

Master does seem to be passing again... we should we rerun the failing test?

base/strings/string.jl Outdated Show resolved Hide resolved
Co-authored-by: Steven G. Johnson <[email protected]>
base/strings/string.jl Outdated Show resolved Hide resolved
@ndinsmore
Copy link
Contributor Author

Is the REPL failure related to this? I am sure it is text heavy but I don't see any of the changed functions in this PR returning a Union?

@oscardssmith
Copy link
Member

I'm slightly scared by this so I'm re-running CI.

@ndinsmore
Copy link
Contributor Author

Looks like it cleared

@oscardssmith oscardssmith merged commit b554e8f into JuliaLang:master Apr 12, 2023
@StefanKarpinski
Copy link
Member

Exciting times! Was considering how to review this, but it's well tested and I think we'll notice if string stuff is broken.

Xnartharax pushed a commit to Xnartharax/julia that referenced this pull request Apr 19, 2023
* Working Native UTF-8 Validation

---------

Co-authored-by: Oscar Smith <[email protected]>
Co-authored-by: Steven G. Johnson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster strings "Strings!" unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants