Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode/utf8: potential performance improvement of UTF8 validation #69915

Open
romshark opened this issue Oct 16, 2024 · 8 comments
Open

unicode/utf8: potential performance improvement of UTF8 validation #69915

romshark opened this issue Oct 16, 2024 · 8 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@romshark
Copy link

Proposal Details

I've been experimenting with different kinds of optimization techniques in order to speed up utf8.Valid and utf8.ValidString and I'd like to share my findings that appear to be low-hanging fruit, at least on ARM64:

Screenshot 2024-10-16 at 23 42 49

The change is minimal but yields performance improvements of up to 41% on Apple M1 Max for all-UTF8 inputs while not showing any significant negative side-effects. Though on AMD Ryzen 7 5700x the geomean is significantly lower at just -1.43%, with some benchmarks even showing significant performance degradation ⚠️.

The test setup with benchmarks can be found here: https://github.com/romshark/utf8valid

Relation to Other Proposals

It might be related to #47120 but I haven't had the opportunity to dig into the details of Russ' proposal. At first glance it looks like Russ tackled exactly that problem and my proposal may just be a duplicate.

Compared to #63347 this proposal doesn't require changes of significant complexity, it's a 9-line code change.

@gopherbot gopherbot added this to the Proposal milestone Oct 16, 2024
@gabyhelp
Copy link

@romshark romshark changed the title proposal: Improve performance of utf8.Valid and utf8.ValidString on ARM64 observation: Potential performance improvement of UTF8 validation Oct 16, 2024
@ianlancetaylor
Copy link
Member

I don't see any API change here, so taking this out of the proposal process.

@ianlancetaylor ianlancetaylor changed the title observation: Potential performance improvement of UTF8 validation unicode/utf8: potential performance improvement of UTF8 validation Oct 17, 2024
@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Proposal labels Oct 17, 2024
@ianlancetaylor ianlancetaylor modified the milestones: Proposal, Backlog Oct 17, 2024
@randall77
Copy link
Contributor

You've gotten performance improvements but I don't see any description of the changes you made to get those performance improvements (unless the changes are all in that screenshot?).

It would be nice to post a CL, together with some Benchmarks that demonstrate the improvements (if those benchmarks are not already present in the utf8 package).

@adonovan
Copy link
Member

adonovan commented Oct 18, 2024

@randall77 See the linked repo, which contains the patched code and the benchmark results. The diff is hard to see (it's not even in the history of the repo), but it really is just this:

@@ -473,16 +473,22 @@ func Valid(p []byte) bool {
                        return false // Short or invalid.
                }
                accept := acceptRanges[x>>4]
+               // The change is here. Instead of reusing `size` constants 2, 3 and 4
+               // are used which leads to performance improvement of up to 41%.
                if c := p[i+1]; c < accept.lo || accept.hi < c {
                        return false
                } else if size == 2 {
+                       i += 2
                } else if c := p[i+2]; c < locb || hicb < c {
                        return false
                } else if size == 3 {
+                       i += 3
                } else if c := p[i+3]; c < locb || hicb < c {
                        return false
+               } else {
+                       i += 4
                }
-               i += size
+
        }
        return true
 }
@@ -519,16 +525,21 @@ func ValidString(s string) bool {
                        return false // Short or invalid.
                }
                accept := acceptRanges[x>>4]
+               // The change is here. Instead of reusing `size` constants 2, 3 and 4
+               // are used which leads to performance improvement of up to 41%.
                if c := s[i+1]; c < accept.lo || accept.hi < c {
                        return false
                } else if size == 2 {
+                       i += 2
                } else if c := s[i+2]; c < locb || hicb < c {
                        return false
                } else if size == 3 {
+                       i += 3
                } else if c := s[i+3]; c < locb || hicb < c {
                        return false
+               } else {
+                       i += 4
                }
-               i += size
        }
        return true
 }

@ldemailly
Copy link

ldemailly commented Oct 19, 2024

I guess the speedup comes from https://cs.opensource.google/go/go/+/refs/tags/go1.23.2:src/unicode/utf8/utf8.go;l=508

		size := int(x & 7)

and the compiler can't know that 5,6,7 are never supposed to happen (vs hardcoding += 4)
(thanks to how first[] is constructed)

nor 1 because of

		if si < RuneSelf {

@ldemailly
Copy link

I think you should make a CL but also add comments to explain why it works / why +4 is the only possible case in the else

also I wonder if you could benchmark that vs a switch (which would be more readable and equivalent I think)

@AlexanderYastrebov
Copy link
Contributor

AlexanderYastrebov commented Oct 20, 2024

It might be related to #47120 but I haven't had the opportunity to dig into the details of Russ' proposal. At first glance it looks like Russ tackled exactly that problem and my proposal may just be a duplicate.

DFA implementation from #47120 looks much simpler than existing one.
It would be interesting to see how this change benchmark results compare to DFA results.

@randall77
Copy link
Contributor

I'm not sure I understand how that change makes things any faster, but if there are benchmarks that demonstrate it I'm ok with it.

An aside, feel free to ignore...

size is already needed in a register for the conditional, so it isn't clear to me why += const is any faster than += register. Both are identical speedwise on all chips I've seen. Maybe because we have to merge from each case back to a single += instruction before jumping to the top of the loop, instead of jumping to the top of the loop directly? It might be worth trying += size in each if block instead of after, just to see what happens. Not that the final code should do that, but just to tease out why there is a speed improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

8 participants