proposal: unicode/utf8: rune count in a valid UTF-8 string #57896

wojciech-sneller · 2023-01-18T08:22:56Z

I'd like to propose a function to return the number of runes in a valid UTF-8 string. Such function can be a few times faster than utf8.RuneCount -- please check our results: SnellerInc/sneller@9ee35af.

There are use cases when we are sure that the input is valid. Also the Go standard library already provides ToValidUTF8 (https://pkg.go.dev/strings#ToValidUTF8).

The text was updated successfully, but these errors were encountered:

seankhliao · 2023-01-18T09:22:45Z

can you point to examples of places where this function would be used?

ianlancetaylor · 2023-01-18T17:31:48Z

What should the function return if the string is not a valid UTF-8 string after all?

There's no particular reason that this function has to be in the standard library. Would it make sense to make it available as a third-party library and see if it gets adoption? https://go.dev/doc/faq#x_in_std

wojciech-sneller · 2023-01-19T19:00:06Z

In the case of invalid UTF-8 string the function would return garbage. The use case I have in mind is a system which accepts some possibly broken input, but validates it early and only valid input is passed down; system's components receive trusted, valid strings.

This is approach we used also in simdutf library: the API contains fully validating converters, but there are also faster counterparts that assume valid inputs.

martisch · 2023-01-22T07:12:06Z

I think its better for more optimised functions (e.g. assuming valid utf8, ascii characters only, mostly ascii, mostly non ascii, ...) to be exposed in special libraries. Otherwise we end up with lots of different functions (and maybe not even clearly naming their difference in assumptions) in the utf8 package all performance optimized for some case but also easily misused.

I think if the library itself can convey what its optimized for it would be better. It does not seem necessary for such a library similar to also simdutf8 to be in the standard library.

That said if the existing utf8 functions can be made faster for common cases without making them more "unsafe" or more unnecessarily complex vs the performance gain that is a possibiity.

rsc · 2023-03-15T21:14:34Z

In general we work very hard to ensure that Go functions do not "return garbage". That is the C/C++ way, not the Go way. My rant about where that path leads is at https://research.swtch.com/plmm#ub.

wojciech-sneller · 2023-03-16T00:10:45Z

Thanks for the comments. After rethinking that, I see the core library is not a proper place for such specialised procedures. My only excuse was existing ToValidUTF8.

One thing I strongly disagree with is claiming that SWAR techniques are unsafe or has anything to do with memory model. It's about unusual access to a well defined data structure. An UTF-8 string is just a sequence of bytes, but it can be viewed as sequence of uint64. BTW, Go allows to modify individual bytes of a string, thus users can do anything and produce garbage sequences.

rsc · 2023-04-06T15:36:48Z

This proposal has been declined as retracted.
— rsc for the proposal review group

wojciech-sneller added the Proposal label Jan 18, 2023

gopherbot added this to the Proposal milestone Jan 18, 2023

seankhliao changed the title ~~proposal: unicode/utf8 - rune count in a valid UTF-8 string~~ proposal: unicode/utf8:- rune count in a valid UTF-8 string Jan 18, 2023

seankhliao changed the title ~~proposal: unicode/utf8:- rune count in a valid UTF-8 string~~ proposal: unicode/utf8: rune count in a valid UTF-8 string Jan 18, 2023

ianlancetaylor added this to Proposals Jan 18, 2023

ianlancetaylor moved this to Incoming in Proposals Jan 18, 2023

wojciech-sneller closed this as completed Mar 16, 2023

ianlancetaylor removed this from Proposals Mar 16, 2023

rsc added this to Proposals Apr 6, 2023

rsc moved this to Incoming in Proposals Apr 6, 2023

rsc moved this from Incoming to Declined in Proposals Apr 6, 2023

rsc removed this from Proposals Mar 15, 2024

golang locked and limited conversation to collaborators Apr 5, 2024

gopherbot added the FrozenDueToAge label Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: unicode/utf8: rune count in a valid UTF-8 string #57896

proposal: unicode/utf8: rune count in a valid UTF-8 string #57896

wojciech-sneller commented Jan 18, 2023

seankhliao commented Jan 18, 2023

ianlancetaylor commented Jan 18, 2023

wojciech-sneller commented Jan 19, 2023

martisch commented Jan 22, 2023

rsc commented Mar 15, 2023

wojciech-sneller commented Mar 16, 2023

rsc commented Apr 6, 2023

proposal: unicode/utf8: rune count in a valid UTF-8 string #57896

proposal: unicode/utf8: rune count in a valid UTF-8 string #57896

Comments

wojciech-sneller commented Jan 18, 2023

seankhliao commented Jan 18, 2023

ianlancetaylor commented Jan 18, 2023

wojciech-sneller commented Jan 19, 2023

martisch commented Jan 22, 2023

rsc commented Mar 15, 2023

wojciech-sneller commented Mar 16, 2023

rsc commented Apr 6, 2023