Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add byte-array-to-string conversion to system module #14810

Open
snej opened this issue Jun 25, 2020 · 10 comments
Open

Add byte-array-to-string conversion to system module #14810

snej opened this issue Jun 25, 2020 · 10 comments

Comments

@snej
Copy link

snej commented Jun 25, 2020

Summary

The standard library could use a function to convert an openarray[char] or openarray[byte] to a string.

Description

Converting between byte arrays and strings is pretty common, in my experience. It's possible to cast from a seq[char] or seq[byte] to a string using cast[string](...), and the compiler allows you to do the same with an openarray, but at runtime you get a garbage string object. (It appears that the initial bytes of the array get reinterpreted as the string's length and address.) Here's an example.

This seems rather dangerous: an idiom that works in one case fails in another, but subtly enough that it might be overlooked at first. Results I've observed have been either a several-megabyte string of garbage, or an out-of-memory exception (when Nim tries to allocate a copy of a terabytes-long string.) I'm guessing that mutating the string might lead to memory corruption ... and it might be possible to craft a byte array that overwrites specific memory addresses, making this a potential security exploit.

I realize the cast operator is clearly documented as dangerous! The problem here isn't that using it causes undefined behavior, rather that (a) it's the only simple way to convert an array to a string, and worse, (b) it works for some types of arrays but not others.

Proposal

Add some toString(...) procs to the system module:

proc toString(chars: openarray[char]): string
proc toString(bytes: openarray[byte]): string

($ would be a better name, but that operator is already defined in dollars.nim, and has a different purpose.)

Alternatives

The only way I've found is to create an empty string and copy the bytes:

proc toString(bytes: openarray[byte]): string =
  result = newString(bytes.len)
  copyMem(result[0].addr, bytes[0].unsafeAddr, bytes.len)

It's a fine solution, but we shouldn't make developers use highly unsafe language features just to perform a common operation! Instead, that should be the implementation of the standard library function.

@timotheecour
Copy link
Member

I agree with the general idea (common cases like this should not require a cast for reasons you've already mentioned); but I'd do it a bit differently than what you're suggesting.

@Clonkk
Copy link
Contributor

Clonkk commented Oct 22, 2020

This would be really useful.

Ideally, it should be possible to get an immutable string from a openArray[byte] (and a way to get an immutable openArray[byte] from a string) without having to copyMem the datas.

@ringabout
Copy link
Member

ringabout commented Nov 4, 2020

I propose this solution in std or fusion:

func toByteSeq*(str: string): seq[byte] {.inline.} =
  ## Converts a string to the corresponding byte sequence.
  @(str.toOpenArrayByte(0, str.high))

func toString*(bytes: openArray[byte]): string {.inline.} =
  ## Converts a byte sequence to the corresponding string.
  let length = bytes.len
  if length > 0:
    result = newString(length)
    copyMem(result.cstring, bytes[0].unsafeAddr, length)

@Araq
Copy link
Member

Araq commented Nov 4, 2020

Your inline here is cute, it's not like that the calling overhead of X nano seconds makes a difference for a memory allocation plus copy operation... The real problem here is insufficient design, if you used string for binary data just stick with it, don't convert it to seq of bytes back and forth...

@ringabout
Copy link
Member

ringabout commented Nov 4, 2020

Yes, sometimes I need this.
For examples, nimcrypto only provides openArray[byte] interface, I need to convert this to string.

proc randomString*(size = DefaultEntropy): string {.inline.} =
  ## Generates a new system random strings.
  result = size.randomBytesSeq.fromByteSeq

And sometimes I need to debug openArray[byte] or for tests, I want to see the pretty format for openArray[byte] and I need stream interface.

    let 
      serialize = [0'u8, 0, 2, 0, 8, 1, 71, 174, 20, 1, 99, 99, 0]

    var str = fromByteSeq(serialize)
    let 
      strm = newStringStream(str)

    discard strm.readFrameHeaders

@Clonkk
Copy link
Contributor

Clonkk commented Nov 4, 2020

IMO, the functionality is needed.

When you use multiple external API, there are cases where you need one string and other where you need seq[byte].

My issue with the proposed solution :

  • It duplicates datas.
  • Having to allocate + copyMem every time is costly.
  • Ideally, there could very well be some APIs that store binary 16bit datas as seq[uint16] (not sure about this since casting seq between type of different sizeof is tricky).

Something like this (or whatever syntax is best) :

var buffer = newBuffer(bufsize)

with buffer as seq[uint16]:
  # ...
with buffer as seq[byte]:
  # ...
with buffer as string:
  # ...

@timotheecour
Copy link
Member

relevant but doesn't close this issue: #15951

@ZoomRmc
Copy link
Contributor

ZoomRmc commented Apr 16, 2022

I don't think this is a "good first issue" at all because it's absolutely unclear what will be accepted and where to actually put it, even though the solution is pretty obvious.

Can't we just stick

template whenJSorVM(isTrue, isFalse: untyped) = # TODO: replace with a macro
  when nimvm:
    isTrue
  else:
    when (defined(js) or defined(nimdoc) or defined(nimscript)):
      isTrue
    else:
      isFalse

func toString[T: char|byte](src: openArray[T]): string {.noinit.} =
  ## Returns a new string with contents copied from `src`.
  result = newString(src.len)
  whenJSorVM:
    for i in 0..high(src):
      result[i] = src[i]
  do:
    if src.len > 0:
      copyMem(result[0].addr, src[0].unsafeAddr, src.len)

anywhere and mark it as unstable API and not wait for various tangential RFC (such as nim-lang/RFCs#191)? This pattern is certainly very frequent in the wild so probably its existence is justified by now. This could be held under unstable API until we get string views or something else which is deemed "sufficiently designed".

@ZoomRmc
Copy link
Contributor

ZoomRmc commented May 12, 2022

Ok, so this definitely needs to be in the system.nim because it also has a substr which has to be tweaked to take openArray[char] as well and it begs to use toString then.

Code block in my previous comment updated.

@saemideluxe
Copy link
Contributor

I agree with @xflywind : Interopability between different APIs in the same codebase would be improved with this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants