-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: unicode/utf8: rune count in a valid UTF-8 string #57896
Comments
can you point to examples of places where this function would be used? |
What should the function return if the string is not a valid UTF-8 string after all? There's no particular reason that this function has to be in the standard library. Would it make sense to make it available as a third-party library and see if it gets adoption? https://go.dev/doc/faq#x_in_std |
In the case of invalid UTF-8 string the function would return garbage. The use case I have in mind is a system which accepts some possibly broken input, but validates it early and only valid input is passed down; system's components receive trusted, valid strings. This is approach we used also in simdutf library: the API contains fully validating converters, but there are also faster counterparts that assume valid inputs. |
I think its better for more optimised functions (e.g. assuming valid utf8, ascii characters only, mostly ascii, mostly non ascii, ...) to be exposed in special libraries. Otherwise we end up with lots of different functions (and maybe not even clearly naming their difference in assumptions) in the utf8 package all performance optimized for some case but also easily misused. I think if the library itself can convey what its optimized for it would be better. It does not seem necessary for such a library similar to also simdutf8 to be in the standard library. That said if the existing utf8 functions can be made faster for common cases without making them more "unsafe" or more unnecessarily complex vs the performance gain that is a possibiity. |
In general we work very hard to ensure that Go functions do not "return garbage". That is the C/C++ way, not the Go way. My rant about where that path leads is at https://research.swtch.com/plmm#ub. |
Thanks for the comments. After rethinking that, I see the core library is not a proper place for such specialised procedures. My only excuse was existing ToValidUTF8. One thing I strongly disagree with is claiming that SWAR techniques are unsafe or has anything to do with memory model. It's about unusual access to a well defined data structure. An UTF-8 string is just a sequence of bytes, but it can be viewed as sequence of uint64. BTW, Go allows to modify individual bytes of a string, thus users can do anything and produce garbage sequences. |
This proposal has been declined as retracted. |
I'd like to propose a function to return the number of runes in a valid UTF-8 string. Such function can be a few times faster than
utf8.RuneCount
-- please check our results: SnellerInc/sneller@9ee35af.There are use cases when we are sure that the input is valid. Also the Go standard library already provides
ToValidUTF8
(https://pkg.go.dev/strings#ToValidUTF8).The text was updated successfully, but these errors were encountered: