-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Standardize API of all crypto/hashes packages #32
Comments
Mentionned by @arnetheduck , C++17 introduced a distinct type called byte
|
Yes! This has been annoying me for quite some while now. Another thing that should be taken into consideration is reading of data from a stream and writing to one. Whatever solution this comes to, the corresponding base type should be added to the streams module as well so as to facilitate easy input -> en-/decrypt -> output. Personally I agree with the rationale offered, and while the DSP argument is valid Nim already has things like |
Another consideration for the API is streaming - for this it's customary to have a state that goes through init, update (many times) and finalization.. there can then be convenience helpers that call these three functions for non-streaming cases |
I would vote for:
|
What's the rationale for having a distinct As far as representing binary data in Nim is concerned, I agree with @zielmicha, |
this byte vs uint8 vs char or string/seq problem is not exclusively belongs to crypto/hashes library. Graphics library also suffer from the same problem. How to consistently represent pixels data for graphics API. let's say we already reach consensus here and adopt a common standard. but the problem is still there, the output from other library, e.g. output from graphics library we feed into crypto/hashes library still need go through some sort of conversion. this additional conversion step might not be acceptable under some circumstances. therefore I suggest we could use Nim provided features such as generics or template to create a flexible API that capable of accepting various inputs and produce various outputs. I know this involved quite a lot of work and required more elaborate test, but at least it can prevent future war on types. |
@dom96 and @zielmicha, why reuse the string type for something that's not intuitively a string? Nim has a nice powerful type system, so no need to represent two different things with the same type. Differentiating them means that we could even do nice things like having a $ for byte data that outputs hexadecimal, and we can avoid passing a string to something that takes a byte array and vice-versa. I would say we should go the opposite direction and specify a |
@PMunch Then you can't do The underlying problem is that there is no definite distinction what is a binary string and what is a text string. Python 3 has the same problem. Note how it needs separate stream types for text and binary IO. |
I agree with PMunch, it's not C, we shouldn't use @jangko not ideal but casting works until everyone migrates to the same standard Note that @zah started a library just to deal with this issue when receiving data from external sources: https://github.com/status-im/nim-ranges. proc hash123(s: openarray[byte]): array[N, byte] =
...
proc hash123(s: string): array[N, byte] =
...
proc hash123_fromHex(s: string): array[N, byte] =
... Also maybe it would be worth it to introduce a |
@zielmicha basically what @mratsim said, but you could also have a converter from
Couldn't agree more, and this is what we now have the possibility to add to Nim. By having a bytes type we can easily make the distinction clear, both to the programmer, and to the reader. |
Just last night, I was thinking of writing a Nim API to abstract the C interop in nimssl which is a wrapper for Openssl. Right now, only the SHA and AES portions are wrapped. Based on what is agreed here, accordingly the API can be adjusted. The underlying implementation returns an array of chars which I accordingly cast and then convert to a hex string. |
Mind adding https://nimble.directory/pkg/libsodium https://github.com/federicoceratto/nim-libsodium to the list? |
Here is my proposal. type
A_Digest* = distinct array[X, byte]
A_Context* = object ## accumulator state
# fields here
proc init*(a: var A_Context) = ...
proc update*(a: var A_Context; bytes: openArray[byte]) = ...
proc finish*(a: var A_Context): A_Digest = ...
proc toHex*(a: A_Digest; upperCase: bool = false): string = ... This is the minimal protocol a hashing algorithm should support. The API introduces no overhead as far as I can see and it supports "streaming". On top of that either via a template hashBytes*(algorithm: typedesc; bytes: openArray[byte]; upperCase = false): string =
var context: algorithm
init(context)
update(context, bytes)
toHex(finish(context), upperCase) And let's also have a template toBytes*(x: string): seq[byte] =
cast[seq[byte]](x) Thankfully |
If we are going to define a minimal set of types/concepts in the standard lib, I think we should define a single type
MDigest*[bits: static[int]] = object
data*: array[bits div 8, uint8] Currently, we have such a type in Nimcrypto: Regarding the use of My long-term vision is that this |
Please note that Nimcrypto already has a similar function to the https://github.com/cheatfate/nimcrypto/blob/master/hash.nim#L17 The difference is that it returns a fixed-size Digest struct, which can be then converted to string if the user needs a string representation. |
The only reason why |
We definitely need a way to construct an openarray from the underlying (pointer, length) pair. Probably yet another builtin for system.nim. It's true that my proposal depends on this missing piece. |
Well that is easily done in my proposal too but it seemed rarely useful. In the end the different algorithms return different object types so if I need some common result representation it's often a |
People want strings on their screen, but our code strongly prefers the tightly-packed bytes of the Digest structure (for storing them in memory and databases, for putting them in binary-encoded network messages and so on). |
We also need a way to get sub-slices of the openarrays and a way to keep them around in memory in async code (as @cheatfate mentioned). In our project, I've currently also defined a Here is the But clearly, this problem is separate from the interface issue discussed here. With "manual" openarray construction it will be possible to adapt any range implementation to the proposed interface. |
My sysrandom library also suffers from this kind of problem. Currently it can write the random bytes into a pointer, array of |
Would a type
MDigest*[bits: static[int]] = object
# Message digest result of any hash function
data*: array[bits div 8, uint8]
type HexString = distinct string
proc `$`[bits: static[int]](d: MDigest[bits]): HexString
const hexChars = "0123456789abcdef"
result = newString(2*N div 8)
for i in 0 ..< N:
result[2*i] = hexChars[int ba[i] shr 4 and 0xF]
result[2*i+1] = hexChars[int ba[i] and 0xF]
proc readHexChar(c: char): byte {.noSideEffect, inline.}=
## Converts an hex char to a byte
case c
of '0'..'9': result = byte(ord(c) - ord('0'))
of 'a'..'f': result = byte(ord(c) - ord('a') + 10)
of 'A'..'F': result = byte(ord(c) - ord('A') + 10)
else:
raise newException(ValueError, $c & "is not a hexademical character")
proc hexToMDigest*[bits: static[int]](hexStr: string): MDigest[bits] {.noSideEffect, noInit.}=
## Read an hex string and store it in a message Digest
var i = 0
if hexStr[i] == '0' and (hexStr[i+1] == 'x' or hexStr[i+1] == 'X'):
inc(i, 2) # Ignore 0x and 0X prefix
assert hexStr.len - i == 2 * (N div 8)
while i < N:
result[i] = hexStr[2*i].readHexChar shl 4 or hexStr[2*i+1].readHexChar
inc(i)
proc hexToSeqByte*(hexStr: HexString): seq[byte] {.noSideEffect.}=
## Read an hex string and store it in a sequence of bytes in Big-Endian order
var i = 0
if hexStr[i] == '0' and (hexStr[i+1] == 'x' or hexStr[i+1] == 'X'):
inc(i, 2) # Ignore 0x and 0X prefix
let N = (hexStr.len - i) div 2
result = newSeq[byte](N)
while i < N:
result[i] = hexStr[2*i].readHexChar shl 4 or hexStr[2*i+1].readHexChar
inc(i)
proc toBytes(s: string): seq[byte] {.inline.}=
cast[seq[byte]](s)
|
Actually I really like the (stack) |
@FedericoCeratto, @genotrance, @cheatfate I've updated the original post with libsodium (input/output: string), nimssl (input/output: ptr cuchar + len) and nimcrypto (input: ptr uint8 + len) (further overloads coming to nimcrypto) |
corresponding C++ proposal for span / memrange: http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0122r1.pdf the if doing crypto functions instead (en/decrypt), it might make sense to have caller manage memory so as not to make decisions for them - that would mean an api that takes an openarray-like input and a c++-span-like output (or streams, though that can usually be built as a layer on top). the same kind of api would be good for compression too, and a wide range of other use cases as others have noted. |
There is some discussion about views/memranges here: #5753 It was settled that openarray[T] will have this role (#5957). |
Tagging @krux02 as he already requested C++ span/view/memrange/slice/openarray here nim-lang/Nim#5437 In the wild:
|
@mratsim Thanks for the invite. I will now also add my two cents. I just started reading this thread, and yes I agree, a consistent hashing interface is important. I like the interface that Araq suggested. It looks a lot like the old hashing interface, but without the ugly hashing operator definitions. In my opinion the hashing operators are a bad example for operator overloading/definition. I just don't agree with the suggested convenience template. It is called The only case that I would like to throw into the pot to think about is to not make runtime branching on hashing algorithms easy. But with the proposed interface it should work nicely to implement something like this: proc hash(hashKind: HashKind; bytes: openarray[byte]; dst: var openarray[byte]): string =
case hashKind
of ... |
Looking back at some of the older messages, in reference to @Araq's reply, I'm incredibly surprised that this is valid Nim code: template toBytes*(x: string): seq[byte] =
cast[seq[byte]](x) Is it really valid? Are sequences and strings really that similar? My reasoning for preferring I was under the impression that converting this data into a |
@krux02 will your proc hash[H: HashKind](bytes: openarray[byte]): HashKind =
when H is: ... @dom96, byte = char (from C standard: a char and a byte are the smallest addressable unit of memory) and in 99% of the case char = uint8 (except the DSP i mentionned). Edit: In the past Also for C compatibility we have |
I already said it in the chat, but I will repeat it here as well. When you abuse a string as binary data. Be aware that you are in fact abusing the sting type. Never ever expose this abuse in the public API. And to make sure you never expose this abuse in the public API, just also don't use it internally. And to prefere string for C compatibility, is wrong in several layers. First of all, when you warp a C library, provide a nice Nim wrapper. Please don't expose c types in the Nim API. If you do so you force everybody who uses that library to put the conversion at the usage side. And that is just not very nice. And then, string is not any less or more compatible with C than any other type in Nim. |
the name |
As a first step we released a byteutils package to deal with byte <-> hex conversion to and from seq[byte] and openarray[byte]. Also regarding @arnetheduck remark, we settled so far on the MDigest (Message Digest) type for crypto hashes defined in nimcrypto here. Basically all hashes provide the following interface:
type
MDigest*[bits: static[int]] = object
data*: array[bits div 8, byte]
bchar* = byte | char
proc `$`*(digest: MDigest): string =
result = ""
var i = 0'u
while i < uint(len(digest.data)):
result &= hexChar(cast[byte](digest.data[i]))
inc(i)
proc digest*(HashType: typedesc, data: ptr byte,
ulen: uint): MDigest[HashType.bits] =
mixin init, update, finish, clear
var ctx: HashType
ctx.init()
ctx.update(data, ulen)
result = ctx.finish()
ctx.clear()
proc digest*[T](HashType: typedesc, data: openarray[T],
ostart: int = 0, ofinish: int = -1): MDigest[HashType.bits] =
mixin init, update, finish, clear
var ctx: HashType
let so = if ostart < 0: (len(data) + ostart) else: ostart
let eo = if ofinish < 0: (len(data) + ofinish) else: ofinish
let length = (eo - so + 1) * sizeof(T)
ctx.init()
if length <= 0:
result = ctx.finish()
else:
ctx.update(cast[ptr byte](unsafeAddr data[so]), uint(length))
result = ctx.finish()
ctx.clear() In the HashType.bits bits is a static[int] as shown below. The definition of a HashType, for example sha3/keccak: type
KeccakKind* = enum
Sha3, Keccak, Shake
KeccakContext*[bits: static[int],
kind: static[KeccakKind]] = object
q: array[25, uint64]
pt: int |
For the byte format it would be nice to have a corresponding stream in the |
Yes, a corresponding stream type would be really useful.
… On 3 Apr 2018, at 08:18, PMunch ***@***.***> wrote:
For the byte format it would be nice to have a corresponding stream in the streams module as well. So you would be able to easily read data in from, and write data out to such a stream.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I created a |
If I may add my two cents, a hash function should support a stack |
I've had a look into the crypto packages in stdlib (md5 and sha1). I don't see away to gradually make the change with a {.deprecated.} pragma. The way forward would be to deprecate those packages altogether and either redirect to nimble packages or implement a revamped crypto stdlib with the desired API. |
fwiw, https://github.com/status-im/nim-stew is the new place for experimental API's that we produce (byteutils moved in there) |
A cryptographer and security auditor just made an article about the security issues caused by using strings instead of byte arrays: https://littlemaninmyhead.wordpress.com/2021/09/15/if-you-copied-any-of-these-popular-stackoverflow-encryption-code-snippets-then-you-did-it-wrong/ |
As we're working with crypto at Status, we regularly get bitten by the different types in Nim libraries to represent binary data.
Context:
Crypto and hashing libraries in the wild uses arrays or seq of uint8, char, byte, object, uint32, cuchar and also strings to represent the binary hashes. I have noted the following issues:
A. Output hash type is inconsistent:
B. 0.18 introduction of
$
for arrays broke libraries:C. Input type is inconsistent:
its dependency memRange package
D.
$
Conversion of the hash type is inconsistent between string, hex or binary representation:$
on array[N, char] (N = 28, 32, 48, 64 only) returns a binary stringbin2hex
for the hex representation.$
for array of bytes (returns a string of uint8) and requiretoHex
for the hex representation$
(it's a pointer). It requirestoHex
for the hex representation.Note that we are currently having this discussion internally, but since Araq mentionned his pain working with hashes following 0.18, I believe the whole Nim ecosystem would benefit.
Stakes
Nimbus (A client for Ethereum 2.0) will bring a lot of attention to Nim crypto ecosystem and it's in our best interest to adopt healthy common standards.
Beyond the internal representations, I think a common set of APIs is what matters
Questions and considerations for API:
1. "String" input
proc hash1234(s: string): Hash = ...
, what shoulds
be, a hex string or binary data?2. Variable size input
Which one(s) among
proc hash1234(s: seq[uint8]): Hash = ...
proc hash1234(s: seq[byte]): Hash = ...
proc hash1234(s: seq[char]): Hash = ...
proc hash1234(s: string): Hash = ...
3.
$
behaviourWhich one among:
4. Chaining
hash1234(hash1234(foo))
should be allowed without casting/conversion5. Hex output
Which one among:
$
toHex
Opinions
For consistency one type between byte/uint8/char should be used.
At the same time one type between seq[byte]/seq[uint8]/seq[char] and string should be used.
Note: Those are only my opinions and does not represent the opinion of all Status devs
Opinion 1:
$
is very error prone right now. From an ergonomic point of view, it makes sense to convert binary blob/hashes to their hex representation.Opinion 2:
$
converting to the string repr of seq[uint8] can be avoided by usingdistinct
types or wrapping Hash = array[N, byte|uint8] in an object.Opinion 3: The internal representation should be byte.
Rationale:
A counterpoint can be made that byte (or char) hides the actual size of the type.
However Windows and POSIX require 1 byte = 8-bit. The only systems with 1 byte != uint8 are DSP chips (they use 1 byte = 12, 16, 24 bit), see here.
Opinion 4: Let's nail it! 🎆
Inviting: @jangko, @cheatfate, @Araq, @zah, @yglukhov to the discussion.
The text was updated successfully, but these errors were encountered: