Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta - RFC] The meaning of strings #70

Open
mratsim opened this issue Oct 11, 2018 · 5 comments
Open

[Meta - RFC] The meaning of strings #70

mratsim opened this issue Oct 11, 2018 · 5 comments

Comments

@mratsim
Copy link
Collaborator

mratsim commented Oct 11, 2018

Context

In all programming languages, string is one of the key type and also one that causes the most controversies and performance issues. I think we need guidelines (NEP-3?) specifically dedicated to them.

Implementations:

There are 2 official kind of strings in Nim:

string, an implementation of Pascal strings. Pascal strings are a variable-length container + length. In Nim the implementation is equivalent to a seq[char] and we can actually cast between them.

cstring, an implementation of C string. C-strings are pointers, the data is anything between the address pointed to and the first '\0' byte.

It's worth noting that unicode strings are built on top of the regular strings.

Performance:

Using strings naively has several performance cost: the first one is the memory allocation for each string especially temporaries when function chaining or splitting a strings.

C-strings incurs a linear cost every-time we want to compute their lengths, using them with the standard library might also require a copy when interfacing with a C library

Controversies:

  1. The first one is the matching between array[N, char]/seq[char]/openarray[char] and cstring or strings. It was added in 0.12, removed in 0.17.2 or 0.18 and added back again in 0.19 (commit/PR missing).

    One of the use cases by @Tim-St is optimizing a huge number of reads with openarray[char]/ptr + len by avoiding copies to Nim string.

  2. Another controversy is the use of string to store binary blobs as discussed in the RFC 7337

  3. How to deal efficiently with unicode is also up in the air. I'm not experienced at all in Unicode, but I have a great deal of interest in it to parse text in non-English language for machine learning.

My opinion

  1. I think point 1. can be solved by having the following proc:

    proc toString(p: ptr char, len: cint): lent string =
      ...

    similar to toOpenArray this would allow interfacing with C lib guaranteeing no copy and using the Nim stdlibs on that without the implicit conversion that caused issues in #6350

  2. As explained in #7337 a string is semantically different from a memory blob. Nim type system is rich enough and we should use the seq[byte] type instead of repeating the mistakes from C. There is a toOpenArrayByte proc in system and ByteStream in streams to ease the transition

  3. Not sure but a compiled language with a good unicode story and speed would have a great appeal. Maybe at @bung87 and the other contributors at Nim-NLP want to chime in?

@Araq
Copy link
Member

Araq commented Oct 11, 2018

added back again in 0.19 (commit/PR missing).

Wait ... what? That's just an unintended regression then.

@mratsim
Copy link
Collaborator Author

mratsim commented Oct 11, 2018

https://irclogs.nim-lang.org/11-10-2018.html#11:00:57

mratsim: openarray[char] shouldn’t match string.
tim-st: @mratsim it does match string
tim-st: in 0.19.0
Araq: but it does and it was added after a feature request...
...
mratsim: but it was removed after a freature request after 0.17.2 no?
Araq: I don't know, we changed that array[char] is compatible with cstring

@bung87
Copy link

bung87 commented Oct 11, 2018

as @mratsim mentioned,I have a story to tell during developing https://github.com/Nim-NLP/finalseg.

  1. regex split nre and re about 4 times slower than python version when I split regex contains unicode script, you can find the detail here regex split performance Nim-NLP/finalseg#1
  2. not sure relevant seq assignment very slow
  3. $seq[Rune] slow, so I keep trace string offset(with start offset and end offset) instead of Runes.
  4. table clear slow.

@metagn
Copy link
Contributor

metagn commented Apr 28, 2020

This RFC fails to mention JS cstrings which are nothing like C cstrings, account for unicode by default, implicitly cast to pointer but not ptr char (nim-lang/Nim#14097, this is not hard to fix), have no instance method bindings from Nim (i.e. String.prototype.trim), but also no support in strutils without a conversion to string.

The best way to solve this is to make string a seq[char], seq[T] and cstring concrete pure types (except in JS where you either define a new type JsString or specialize cstring to import it); then write procs like the ones in strutils for concepts like Indexable[char]/Traversable[char]/Iterable[char]/Enumerable[char] (whatever the difference is when they're implemented) a la #50.

I do not think there is a single solution for "what should be a binary blob". Except for non-var types, then first class openarray fits, which can just be a VLA without a capacity field (and in JS, the Blob type). There's the problem of char vs byte which probably has no solution, except another distinct type that makes everyone lose.

@juancarlospaco
Copy link
Contributor

juancarlospaco commented Jan 12, 2021

Prove me wrong, but Python string is faster because string interning is builtin and enabled by default,
string interning is kinda like a transparent cached/memoized string,
a string interning module should be added to stdlib, should not take more than 1 file for C target only at least.

For JS speed string interpolation, but low priority for this.

Theres people that want a read-only getter for the cap of string for FFI purposes.

Most string ops are basically for loops working on a char,
this needs a macro-based loop unrolling, like C/C++/Pascal had for decades.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants