-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-canonical CBOR Serialization (Optional) #356
Comments
We already offer exact round-trip encoding with
This is exactly why we overhauled I'm a little confused what the underscore here is. Why is it added, and what does that have to do with canonical vs non-canonical CBOR? That only impacts:
Maybe I'm just not understanding what the underscore is meant to be. What tools are they using instead, and how do they handle it? I know CSL just disregards any encoding details from the deserialization and serializes using an arbitrary format (mostly canonical with a couple changes). |
Is this about when you're creating something from scratch (not deserialized)? Generally that's not great for hashing and you should always be working with direct deserialization or raw bytes instead of hoping two tools line up when creating both separately. I think that's what all the safe hash/original bytes stuff in cardano node is for. You can change the |
AFAIK The underscored is an optional indicator, https://datatracker.ietf.org/doc/html/rfc7049#section-6.1 The original lucid which uses a CML internal fork it's working with the non-canonical format and it outputs cbor like this |
ideally it would be great to have something like |
It says those are for the diagnostic notation, not for the CBOR bytes itself, if I'm understanding this correctly. The diagnostic notation is a human readable way to write what the underlying bytes there are, since in the notation something like Are you ended up with different bytes? Could you post them along with how you got them (e.g. CML -> other tool or other tool -> CML)? |
I think there must be some confusion here. There are a combinatorial amount or ways to have non-canonical encodings. Something complex like a whole transaction could have millions of ways (maybe billions when you consider strings/bytes in them! indefinite string/bytes serialization options can be pretty wild). The default |
this is the ouput from lucid-cardano using the internal cml fork import * as LucidCardano from "lucid-cardano";
console.log(LucidCardano.Data.to(new LucidCardano.Constr(0, ["deadbeef"])));
this is the ouput from lucid-evolution using dcspark cml import * as LucidEvolution from "@lucid-evolution/lucid";
console.log(LucidCardano.Data.to(new LucidEvolution.Constr(0, ["deadbeef"])));
you can paste the above in https://cbor.nemo157.com to get the diagnostic notation |
You should not be creating the two separately in two different tools and expecting the hash to line up as there are myriad ways the two could differ for any non-trivial CBOR structure. Hashing CBOR data should always be done on the original bytes. If you need to use CML you can create it in lucid, then use try: import * as LucidCardano from "lucid-cardano";
let datum1 = LucidCardano.Data.to(new LucidCardano.Constr(0, ["deadbeef"]));
let bytes = datum1.to_cbor_bytes(); // not sure what it's called in lucid
import * as LucidEvolution from "@lucid-evolution/lucid";
console.log(LucidCardano.Data.from_cbor_bytes(bytes)); This looks exactly like a case for what I mentioned earlier with I can implement this for plutus datums if it is a common issue. It can also be resolved with better general developer understanding about CBOR/hashing, but I guess it can't hurt to have too. I put a CBOR section in the new docs (with 6.0.0) on this but to be honest CBOR + hashing is a bit of a mess. Everything would be so much simpler if IOHK enforced canonical-only. The way it works is the cardano node defaults to definite for empty and indefinite for plutus datums. In CSL/old CML we had this hard-coded to immitate that since we didn't have proper round-tripping like we do now. |
tl;dr: as long as you are creating it using relying on a specific cbor encoding without that being documented is not good and is precisely why we spent a lot of time on ensuring that cddl-codegen/CML remember every little encoding option. Just on that simple datum there are |
I’ve also don’t like fact about having different standards that’s the main reason we decided to build lucid-evolution with CML which follows the right one, but the dev community is having issues with transitioning to our library just because of the serialization discrepancy. It’d nice if you can enable that option . btw We don’t want to use lucid-cardano for this data transformation |
There is no "right one" - that's the problem. You have to handle ANY CBOR variation. You saw how there are over 50000 ways to encode a simple datum of a 4 byte field. The only time there is a "right way" to encode CBOR is when the protocol you're working with explicitly sets one (usually this is canonical CBOR, which is why it was invented). If that hasn't happened then all bets are off - any of those 50000+ encodings is just as "right" as any other one. There is no such specified encoding for the cardano protocol, there are just some tools that happen to match in some spots (likely just datums too, I'd bet that Lucid doesn't match the node in most other spots). How the cardano node serializes CBOR can and has changed before and is entirely an implementation detail. You can submit a tx/datum to the network using ANY weird encoding and the cardano-node will accept it, that is the reason why the node itself has to remember the original bytes of everything everywhere so hashes still work. When working with datums from another tool or on-chain you simply cannot try re-creating it using the construction API - you need to be working on the bytes created from the other tool or on-chain.
That said, as mentioned earlier, if you need a specific format for
What exactly is the data transformation that needs being done? I assume it's not on the datums, as changing them in any way would change the hash so you should only care about the hash/cbor format of datums that are already finalized and ready for the chain. |
Returns the same datum but with the encoding details changed so that it will match cardano-node/CSL/Lucid This is only in cases where dealing in raw bytes and/or `from_cbor_bytes()` is somehow not possible, as that solution is 100% fool-proof with any tool always for calculating hashes. Fixes #356
I hope I didn't rant too much earlier, it's just a really, really bad idea to rely on any tools 100% agreeing with all possible CBOR encodings. Always work with raw bytes (and things in CML deserialized from those bytes) whenever possible to avoid sneaky hash mismatches with some specific datums. They might match for most when you're testing, but that doesn't mean it will match always unless both tool makers seriously try to match each other and never change and not accidentally having 1 small thing different in some cases that causes the whole hash to rarely mismatch even if it usually is the same. I just put up a PR #357 does this work for you? |
yes this is perfect, thank you! |
Tools like Aiken and Plutarch add an extra underscore
_
to items during CBOR serialization.This inconsistency becomes problematic when serialized data is hashed, as the resulting hash differs between tools.
In Aiken, the below serialization will produce a CBOR value like
[ _ <hashed_value>]
, whereas CML's serialization would produce[ <hashed_value>]
without the underscore.Example:
Even though CML follows RFC-7049-section3.9 The issue has resulted in some developers discarding CML in favour of other tools that maintain the non-canonical serialization.
CML should offer an option to choose between canonical and non-canonical serialization to allow compatibility across different tools.
The text was updated successfully, but these errors were encountered: