-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add native UTF-8 Validation using fast shift based DFA #47880
Add native UTF-8 Validation using fast shift based DFA #47880
Conversation
Right now this isn't working, working to resolve test failures |
Just a further update, it looks like the state table from the reference I had is wrong. I am working on rebuilding it should be ready tomorrow. |
This is now passing all test on my system I don't know why buildkite is failing, could someone relaunch the test for me? |
Github is broken :) |
Is the PR the right place to document the build process for the state machine table? |
I'd put the table information in a comment in the code. |
I think with the final documentation of the methodology it is ready to go. |
@stevengj & @StefanKarpinski (sorry to drag you into this, but I think you did a lot of the initial strings work) Initial testing seemed to indicate that |
Happy to be pulled in! It certainly seems cleaner to use |
The implementation has changed to using |
@stevengj, @StefanKarpinski & @oscardssmith I have reversed or order of operations on the shift dfa which leads to state that is never dirty (ie. you would have to Also the use of With this PR the processing rate is about 0.5 bytes / cycle for short strings (word to line length) and 1.5 bytes/cycle for longer strings ( file length). This is on par (short) to better (long) than the c implementation. There are two PRs I plan to follow up with:
Let me know if there is anything else that needs to be done to this. |
Everything seems ready to go. |
This is great work. Now that the implementation has moved to Julia, I think we need much more comprehensive tests of the functionality. Previously we could just assume that |
Now that everything goes the the DFA would you agree that if the tests validate the DFA, we can assume |
Test have been added to validate that the DFA state machine returns state as expected and per the Unicode spec. |
935c46d
to
8006d60
Compare
@mkitti The rebase is done at this point, but thank you. Does anyone know why these tests are failing? |
I'm guessing the test failures is unrelated. Could be a problem on master or maybe CI. @oscardssmith do we want to wait until master can pass tests, or is this mergeable now? |
They look unrelated. I'm happy to merge as is if no one objects in the next day or two. |
@mkitti & @oscardssmith this is a little off topic, but when I rebase I normally just merge master to my local master and rebase to that. I have had bad luck recently in picking masters that seem to fail tests. So the question is there a tagged master which is always passing all tests? |
not really. in theory, master is supposed to pass tests, but in practice that sometimes doesn't happen. |
Master does seem to be passing again... we should we rerun the failing test? |
Co-authored-by: Steven G. Johnson <[email protected]>
Is the REPL failure related to this? I am sure it is text heavy but I don't see any of the changed functions in this PR returning a Union? |
I'm slightly scared by this so I'm re-running CI. |
Looks like it cleared |
Exciting times! Was considering how to review this, but it's well tested and I think we'll notice if string stuff is broken. |
* Working Native UTF-8 Validation --------- Co-authored-by: Oscar Smith <[email protected]> Co-authored-by: Steven G. Johnson <[email protected]>
This is a based on the discussion in #41533. It is a julia implementation of a shift based DFA implemented with inspiration from golang/go#47120.
*** Edit the benchmarks have been updated using the code found in ndinsmore/StringBenchmarks***
Throughput improvement: small strings -> large strings
Master:
This PR: