[Meta - RFC] The meaning of strings #70

mratsim · 2018-10-11T11:51:17Z

Context

In all programming languages, string is one of the key type and also one that causes the most controversies and performance issues. I think we need guidelines (NEP-3?) specifically dedicated to them.

Implementations:

There are 2 official kind of strings in Nim:

string, an implementation of Pascal strings. Pascal strings are a variable-length container + length. In Nim the implementation is equivalent to a seq[char] and we can actually cast between them.

cstring, an implementation of C string. C-strings are pointers, the data is anything between the address pointed to and the first '\0' byte.

It's worth noting that unicode strings are built on top of the regular strings.

Performance:

Using strings naively has several performance cost: the first one is the memory allocation for each string especially temporaries when function chaining or splitting a strings.

C-strings incurs a linear cost every-time we want to compute their lengths, using them with the standard library might also require a copy when interfacing with a C library

Controversies:

The first one is the matching between array[N, char]/seq[char]/openarray[char] and cstring or strings. It was added in 0.12, removed in 0.17.2 or 0.18 and added back again in 0.19 (commit/PR missing).

One of the use cases by @Tim-St is optimizing a huge number of reads with openarray[char]/ptr + len by avoiding copies to Nim string.
Another controversy is the use of string to store binary blobs as discussed in the RFC 7337
How to deal efficiently with unicode is also up in the air. I'm not experienced at all in Unicode, but I have a great deal of interest in it to parse text in non-English language for machine learning.

My opinion

I think point 1. can be solved by having the following proc:
```
proc toString(p: ptr char, len: cint): lent string =
  ...
```
similar to toOpenArray this would allow interfacing with C lib guaranteeing no copy and using the Nim stdlibs on that without the implicit conversion that caused issues in #6350
As explained in #7337 a string is semantically different from a memory blob. Nim type system is rich enough and we should use the seq[byte] type instead of repeating the mistakes from C. There is a toOpenArrayByte proc in system and ByteStream in streams to ease the transition
Not sure but a compiled language with a good unicode story and speed would have a great appeal. Maybe at @bung87 and the other contributors at Nim-NLP want to chime in?

The text was updated successfully, but these errors were encountered:

Araq · 2018-10-11T12:01:03Z

added back again in 0.19 (commit/PR missing).

Wait ... what? That's just an unintended regression then.

mratsim · 2018-10-11T12:08:22Z

https://irclogs.nim-lang.org/11-10-2018.html#11:00:57

mratsim: openarray[char] shouldn’t match string.
tim-st: @mratsim it does match string
tim-st: in 0.19.0
Araq: but it does and it was added after a feature request...
...
mratsim: but it was removed after a freature request after 0.17.2 no?
Araq: I don't know, we changed that array[char] is compatible with cstring

bung87 · 2018-10-11T12:58:24Z

as @mratsim mentioned,I have a story to tell during developing https://github.com/Nim-NLP/finalseg.

regex split nre and re about 4 times slower than python version when I split regex contains unicode script, you can find the detail here regex split performance Nim-NLP/finalseg#1
not sure relevant seq assignment very slow
$seq[Rune] slow, so I keep trace string offset(with start offset and end offset) instead of Runes.
table clear slow.

metagn · 2020-04-28T10:18:14Z

This RFC fails to mention JS cstrings which are nothing like C cstrings, account for unicode by default, implicitly cast to pointer but not ptr char (nim-lang/Nim#14097, this is not hard to fix), have no instance method bindings from Nim (i.e. String.prototype.trim), but also no support in strutils without a conversion to string.

The best way to solve this is to make string a seq[char], seq[T] and cstring concrete pure types (except in JS where you either define a new type JsString or specialize cstring to import it); then write procs like the ones in strutils for concepts like Indexable[char]/Traversable[char]/Iterable[char]/Enumerable[char] (whatever the difference is when they're implemented) a la #50.

I do not think there is a single solution for "what should be a binary blob". Except for non-var types, then first class openarray fits, which can just be a VLA without a capacity field (and in JS, the Blob type). There's the problem of char vs byte which probably has no solution, except another distinct type that makes everyone lose.

juancarlospaco · 2021-01-12T18:14:00Z

Prove me wrong, but Python string is faster because string interning is builtin and enabled by default,
string interning is kinda like a transparent cached/memoized string,
a string interning module should be added to stdlib, should not take more than 1 file for C target only at least.

For JS speed string interpolation, but low priority for this.

Theres people that want a read-only getter for the cap of string for FFI purposes.

Most string ops are basically for loops working on a char,
this needs a macro-based loop unrolling, like C/C++/Pascal had for decades.

narimiran transferred this issue from nim-lang/Nim Jan 2, 2019

mratsim mentioned this issue May 19, 2019

Can't pass openArray[char] to string parameter nim-lang/Nim#11277

Closed

mratsim mentioned this issue Jan 12, 2021

The base64 module should deal in openArray[byte] instead of strings nim-lang/Nim#16688

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta - RFC] The meaning of strings #70

[Meta - RFC] The meaning of strings #70

mratsim commented Oct 11, 2018

Araq commented Oct 11, 2018

mratsim commented Oct 11, 2018 •

edited

Loading

bung87 commented Oct 11, 2018

metagn commented Apr 28, 2020 •

edited

Loading

juancarlospaco commented Jan 12, 2021 •

edited

Loading

[Meta - RFC] The meaning of strings #70

[Meta - RFC] The meaning of strings #70

Comments

mratsim commented Oct 11, 2018

Context

Implementations:

Performance:

Controversies:

My opinion

Araq commented Oct 11, 2018

mratsim commented Oct 11, 2018 • edited Loading

bung87 commented Oct 11, 2018

metagn commented Apr 28, 2020 • edited Loading

juancarlospaco commented Jan 12, 2021 • edited Loading

mratsim commented Oct 11, 2018 •

edited

Loading

metagn commented Apr 28, 2020 •

edited

Loading

juancarlospaco commented Jan 12, 2021 •

edited

Loading