-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Meta - RFC] The meaning of strings #70
Comments
Wait ... what? That's just an unintended regression then. |
https://irclogs.nim-lang.org/11-10-2018.html#11:00:57
|
as @mratsim mentioned,I have a story to tell during developing https://github.com/Nim-NLP/finalseg.
|
This RFC fails to mention JS cstrings which are nothing like C cstrings, account for unicode by default, implicitly cast to The best way to solve this is to make I do not think there is a single solution for "what should be a binary blob". Except for non- |
Prove me wrong, but Python string is faster because string interning is builtin and enabled by default, For JS speed string interpolation, but low priority for this. Theres people that want a read-only getter for the Most string ops are basically |
Context
In all programming languages, string is one of the key type and also one that causes the most controversies and performance issues. I think we need guidelines (NEP-3?) specifically dedicated to them.
Implementations:
There are 2 official kind of strings in Nim:
string
, an implementation of Pascal strings. Pascal strings are a variable-length container + length. In Nim the implementation is equivalent to aseq[char]
and we can actually cast between them.cstring
, an implementation of C string. C-strings are pointers, the data is anything between the address pointed to and the first'\0'
byte.It's worth noting that unicode strings are built on top of the regular strings.
Performance:
Using strings naively has several performance cost: the first one is the memory allocation for each string especially temporaries when function chaining or splitting a strings.
C-strings incurs a linear cost every-time we want to compute their lengths, using them with the standard library might also require a copy when interfacing with a C library
Controversies:
The first one is the matching between array[N, char]/seq[char]/openarray[char] and cstring or strings. It was added in 0.12, removed in 0.17.2 or 0.18 and added back again in 0.19 (commit/PR missing).
One of the use cases by @Tim-St is optimizing a huge number of reads with openarray[char]/ptr + len by avoiding copies to Nim string.
Another controversy is the use of string to store binary blobs as discussed in the RFC 7337
How to deal efficiently with unicode is also up in the air. I'm not experienced at all in Unicode, but I have a great deal of interest in it to parse text in non-English language for machine learning.
My opinion
I think point 1. can be solved by having the following proc:
similar to
toOpenArray
this would allow interfacing with C lib guaranteeing no copy and using the Nim stdlibs on that without the implicit conversion that caused issues in #6350As explained in #7337 a string is semantically different from a memory blob. Nim type system is rich enough and we should use the
seq[byte]
type instead of repeating the mistakes from C. There is atoOpenArrayByte
proc in system and ByteStream in streams to ease the transitionNot sure but a compiled language with a good unicode story and speed would have a great appeal. Maybe at @bung87 and the other contributors at Nim-NLP want to chime in?
The text was updated successfully, but these errors were encountered: