-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Size limit of identity hash #130
Comments
64bytes is most definitely a non-starter, as it already won't fit the prefix/varinting of a 512bit hash. More generally: any data that you know won't ever be repeated is a good candidate for inlining. Also an upper limit already exists: the limit of a network block itself ( 1MiB soft, 2MiB-1 hard ). It would make me sad if we put an arbitrary small limit bound by the ability to print out a CID. Also note that other visible system parts, e.g. dirnames have no limits at present. Last but not least: there are currently production systems that do inline ~2k or was it ~4k of data on the dht today ( /cc @ianopolous ) |
That's only a current IPFS limitation. It's not a general limitation if you look at Multihash in isolation. To me identity hashing makes sense to save space if the hashing output it bigger than the input. If you use it for something else, then to me it has the smell of some hack around some limitations that could/should be solved on a better way. |
There are already limits in the go code. We've been using it to inline stuff in Peergos <= 4096+16 bytes (after discussing with core ipfs devs, but I can't find the issue right now @Stebalien, @whyrusleeping ? ). There are earlier discussions on this: cbor-gen also seems to have just added a limit of 100 for this: |
@vmx 2MiB-1 is a libp2p limit - it permeates everything ( but you are correct: multihash in isolation is limited by uvarint alone ) It is generally correct that inlining is there to realize savings, but they come in more forms than just space:
TLDR: We definitely should define a formal limit. I definitely think this limit needs to be in the high kilo byte range, way beyond the proposed 64 bytes. |
I care more about having an actual limit, than what the limit actually is. |
I found the discussion I was referring to: |
We generally assume that CIDs are small and cheap to handle. CIDs are designed to be identifiers and inline CIDs are designed to allow inlining objects that are smaller than the identifier. Keeping this assumption valid is extremely important for performance and usability.
Last time we discussed this, we set the default limit to 32 bytes to be safe and marked the relevant options in go-ipfs as experimental. Really, it sounds like the need here is for a way to pass around blocks with type information, right? We may need to distinguish between the two cases. |
Afaik this part got entirely nixed, and the current implementation will not support over-long CIDs ( /cc @lidel ) |
Where/how would we enforce the limit? We’ve had very similar threads about block size limits and ended up punting any hard limits to the network and storage layer, which in this case isn’t really an option since they rarely decode the block data. Also, just as a matter of fact, many of our existing libraries don’t handle identity multihashes in CID’s correctly and will hand the Link/CID type off to the storage layer to get as if it were any other CID. Similarly, our Block interfaces do not all support data encoded to/from the identity multihash. The strategy for representing and considering “inline blocks” is still in the experimental phase when it’s handled at all. This is a good time for this discussion given that inline block support is still under development, but I want to make sure that we’re setting the right expectations about where this might land. |
When decoding/validating CIDs?
Actual support in block stores is just an optimization. Inline CIDs are still usable as normal CIDs. |
We identified two cases: Data is smaller than the hash would beI'd call this one the original use case. There seems to be agreement that for this use case, there should be a really low limit, in the bytes, rather than KiB range. Inlining blocks/blocks with type informationThis is the Peergos use case. It's more of an optimization. Now the question is. Could we come up with a solution for the Peergos use case, which Peergos could upgrade to, while limiting the identity hash to a small size as some libraries already do (and also I'd be in favour of). |
It seems clear to me that there is a breaking change coming here. So, for what it's worth, I've come up with a backwards compatible solution and we're stopping using identity multihashes to inline small blocks in Peergos now. We can't migrate existing users (we don't have their keys), but I've realised that those that are on our servers are actually ok now because we are using the S3 datastore and managing pins ourselves (indeed if they're logging in through our server they're not even using ipfs at all). Any new users, new data written, or old data modified will not use identity multihashes. I think we can write a gradual migration that runs when they log in. It's painful, but the risk is too high for us. The largest identity multihash we use now is 36 bytes, for public keys (which includes the type of key and key material all encoded as cbor). And we now enforce this as a hard limit for new data. In the future we'd like to use identity multihashes for CSIDH public keys (64 bytes + multikey header), but that is a future discussion. |
multihash is a general purpose standard with general purpose libraries. Given how young the project is, we should assume that the use cases we currently understand are not a complete set and we need to make sure we stay open and accessible to being used for things in the future that we haven’t even thought of yet. With that in mind, I don’t think that we should:
If there’s a universal reason to have a limit on the size of a multihash that we’re confident is always going to be true, then we should adopt it, but that’s not at all what I’m seeing. If, as we already know, IPFS wants to use CID’s for subdomains and therefor needs to enforce a size limit on CID’s which in effect limits the size of a multihash, that’s fine, but that’s IPFS’s decision to make and their limit to enforce. That doesn’t belong in the core of multihash because it’s not universally representative of all the use cases someone might build on multihash. |
I think we should exactly do that. This will prevent systems to randomly blow up when some implementation impose arbitrary limitations. The nice thing about having a limit in multihash is, that you can highly optimize it. You would always know the upper bound of the supported hashes, hence e.g. do things with stack allocations only. This is not possible if you want to support the Identity Hash with maximum compatibility, which would mean 8EiB. |
Then don’t “support the Identity Hash with maximum compatibility” ;) Users and implementations are free to make domain specific decisions about these limits, the right decision for one user will not be the same for another. It’s not the job of the underlying primitive to make these decisions on your behalf because we don’t know what each user’s requirements are. Look at the block limit, to my knowledge only one transport has a real block limit and yet pretty much every user imposes block limits at half the current transport limit because we called it out as a good practice. It’s not a hard requirement in the spec and it’s not enforced by our codec libraries, but it’s a functioning limit everywhere that it matters. I’m not saying we shouldn’t define good practices, and even document what we think is a reasonable target limit for multihash, but we shouldn’t impose that limit in these libraries at that layer or call it out in the specification as a hard requirement. If @ianopolous wants to have big multihashes in his CID’s, he shouldn’t have an issue at the multihash layer, even if he will have issues at the IPFS layer. In the same way that I can create 5MB blocks for a one-off use case knowing that if it ever needs to be used in Bitswap it’s going to break. |
I struggle to see how these inline use cases would exist in You can “inline” node data into the block using any codec that supports the full IPLD data model without hacking it into the CID. I can’t see the utility here other than “we forgot to make this part of the data structure a union” and it seems like the right thing to do there would be to fix the data structure to support that because it’s a lot more complex to deal with data that has been inlined into the multihash. I understand that in the case of It’s not that inlining data isn’t a common and necessary feature, it is, that’s why we fully considered it in But, across a lot of our code, data inlined into the multihash throws a wrench in our layer model and is difficult to support across different Block, Link, and storage interfaces. Most code thinks of a |
why do you say that? Nothing in e.g. ipld-prime land would change:
|
Everything we do is dag-cbor - this has nothing to do with protobuf or unixfs. It is much much more elegant to do it our way (although elegance wasn't our motivation, speed was) because then the same object class always maps to the same ipld structure. The way I've worked around it is to now have two distinct cbor ipld encodings for the exact same type (class) of object, and handle both types explicitly in the deserialization. So the same type of object now has two totally different ipld strutures.
It was trivial to support this for us (4 lines of code globally). There's nothing "manual" to it. |
That’s specific to Go, where the link loader is an abstraction between the storage layer and the decoded node layer. We don’t have that in every language, often the node layer just talks directly to the storage layer, which means every storage API needs to handle this or every line that asks for data by CID needs to handle this. Also, @warpfork will need to weigh in, but I recall him mentioning that there are plenty of things in go-ipld-prime that won’t work well when inlining data this way.
The solution we spent considerable time working through to this problem is unions (mostly kinded unions using Link as a kind). It’s a core feature of IPLD Schemas and translates nicely into every programming language and all the abstractions we’ve built. It’s problematic to have multiple approaches to inlining data and a kinded union provides a much cleaner approach that keeps the type differences clear to everyone. It sounds like you actually want to blur the line a bit on the type differences so I can see how that approach with be more attractive, but as we build out generic libraries it’s rather difficult to have a single type mean very different things. That said, we’re not going to break or disallow anything that is valid CID/multihash, we just may not have a very nice interface for you to use when you inline data this way, which you probably don’t care about since you have your own libraries ;) And as I’ve already stated, I’m rather opposed to setting a hard limit on multihash size in the specs or core implementations. Some libraries and consumers may set limits you’ll have to contend with and I suspect languages or libraries that want to optimize memory allocations will set a configurable limit, none of which are an issue if you were to take the kinded union approach instead. |
I think this is where the disconnect is. Identity CIDs operate on a layer below where a "kinded union" would exist, they are strictly in "codec-land". To put it differently: from link-traversal perspective there is no practical difference between:
We currently support both. The proposal is to limit only one of them, on account of one of them being special. How do these 2 examples differ? |
Once you recognize that links in a node graph are transparently traversed, that the link is resolved to a node that replaces the link representation to become the node representation for that property in the parent node, they are functionally equivalent. Both put the node data in the same place and stored in the same block. Conceptually, this never exists in the decoded node graph:
Instead, this is what happens:
You can observe this in our pathing, where named properties that are links get resolved to their decoded node value. There’s actually no way to return a link from a fully resolved path. let value = 1234
let link = Link( value )
{ property: link } If you resolve the path |
That's exactly the issue, when we don't specify a limit. Your application would work on one implementation, but not by default on some other. Finding out that one of your identity hashes is to long, sounds like a very hard to find bug if you are not aware that there might be a limit. The option might be configurable, but it might not even be configurable at runtime, but at compile time only. Having a limit makes building systems that work everywhere easier, not having a limit makes it harder with little benefit. Storing data in Multihash to me sounds wrong, that is not what Multihash means to me. Though it's kind of a neat hack, so if we want to support large data (still with a limit, but a high one) stored within a Multihash, I suggest we introduce a new codec for that. This way codecs can still support the "the hash of my data is bigger than the data itself" use case, while not being forced to also support "i store data in my multihash" use case. This way also the error reporting will be clear, as you are now see that your codec is not supported instead of having it fail in weird ways. |
I think a lot of the important things have already been said here, but I've been called out, so I guess I feel I ought to weigh in on the record, heh. I'll mostly just re-highlight things that have already been said that I agree with and want to boost, though:
@ianopolous -- I'd be happy to talk to you more synchronously about this if you'd like, but in essence,
... that sounds less like a "workaround" and more like "exactly the right thing" to me :) Maybe there's some different way to organize the type definitions in your language of choice that would make it more natural? Dunno; I'd be willing to look at it with you though, if you'd like. I'll also say: goodness knows golang hasn't been making it exactly easy on me to represent unions either! But it's been a logically sound path to pursue, even when my host programming language hasn't made it frictionless. So far, every time push has come to shove, I've been very happy with the outcomes stemming from our pursuit of unions. |
(Perhaps that comment would be easier to read if we introduced some term other than "inline" for when we put data in the same block rather than a separate block -- I used it describe that general practice in the same comment as discussing "inline CIDs", and that's probably confusing. Forgive me, reader. A better phrasing has not, at this moment, yet occurred to me.) |
To follow up a little more concretely on that allusion to "we have a way to talk about links vs This is a schema snippet that we could use to describe this common scenario:
A block containing one object matching the At no point were inline CIDs I find that this is a very straightforward way to describe this situation: and it works fine with arbitrarily complex structures of data; it works purely in ways that are easy to describe in terms of the Data Model (e.g., without needing to discuss using different variants of CIDs); and because of this simplicity, I think this is generally the sort of approach I would recommend essentially all new code to take in preference to inline CIDs, if at all possible. I've used the schema syntax here only to clarify and describe. It is not necessary to use schemas to do this; one can simply construct data and walk over the data model according to this convention. I don't actually know what reasons there would be to prefer using inline CIDs over using this much simpler model of "embed if you want". If there are some very specific situations that really require using a CID for ${external opaque reason}, maybe that's something we should document in a short list of known situations? My suspicion is that list is going to be very short and have the general feeling of being enumerating "exception rather than the rule". |
I can explain our case in detail if it helps. Note that we have a work around/"exactly the right thing to do" as mentioned above. One of our fundamental objects has a field which is ciphertext, so a structureless byte[], which is limited to 5 MiB. This thing can be either a directory or a file, and indeed we explicitly hide this from the ipld layer (you have to decrypt to decide which). We represent this as a class with a field which is a list of cids. Whenever we want the actual ciphertext we just pass the list of cids to the datastore and get back the results (identity hash or not). However, as I mentioned above, this thing can represent a directory, and this means that when we are traversing a path we must retrieve many of these objects, decrypt the ciphertext and recurse. The critical point in our model is that we aren't assuming the datastore is local, normally it is on a remote server. This means that inlining the directory (or small file) data is critical to speed for us because it results in many fewer network round trips/ DHT retrievals. I'm fine with not using identity mulithashes here, as we're about to migrate towards, but it does result in more complicated code, and I wish there had been a limit or indication that these were meant to be < 100 bytes when it was released in go-ipfs. On the general topic of having a limit. I am very strongly pro having a well defined limit. The reason being that without such a limit we have a system with a primitive that is basically undefined. If some system imposes limits then the end result is a fragmented ecosystem of applications and language implementations that can't talk to each other. |
@ianopolous so to make sure I understood you right: basically your use case is point 3 here: #130 (comment) ? Or in other words: you use inline CIDs to logically separate the "transport/decrypt/low-level-decode" codepath from the "semantic high-level decode" one? |
@ribasushi Yep that's right. I should also add that the reason it is 4096 + 16 is that we pad all plaintext to a multiple of 4096 before encryption to protect the metadata around size. (And the 16 is the encryption overhead) |
This is true of pretty much every protocol. TCP, UDP, HTTP, etc, there’s no size limit in the protocol specification for the total data transferred, but every service provider and implementation sets one. These limits don’t seem to negate the benefits of agreeing on common protocols and clients learn to live within reasonable limits the ecosystem of service providers have set. As an example: I have a script that does GraphQL queries to GitHub’s service and if the request takes too long the gateway kills the connection even though the query was well within the rate limit GraphQL and even their HTTP service set. Service limitations are application and provider specific, and they are applied to all the protocols you touch, we can’t enforce them for everyone or even hypothesize about what all the use cases are. For a lot of people, setting a block size limit of 1mb solves any concerns they might have about large CID’s as a side effect. For others, maybe not. I’m very interested in recommending a reasonable size limit and would expect many consumers to adopt it (similar to how we handled block size limits) but to set a hard limit in the standard is too much for me. |
A surprising number of hash functions are of variable size so it’s not really practical to do this. Something @ribasushi brought up a few months back is that we need to put a size limit in our code for pretty much every hash function. Since these recommendations vary per hash function, we may want to capture this in another table, or expand the multicodec table, because the expected/recommended max size for most hashing functions doesn’t vary by programming language. We haven’t implemented this yet, but we intend to. |
What part is impractical about doing this? We may have a few (i.e. 1-4) hash functions, but the larger their sizes are the more prefix bits we can give them. Our key space is effectively unlimited if we're not focusing on giving out prefixes that are 1-2 bytes. We gave away 100s of spots to other multihashes in the past many of which are probably unused (looking at you blake2s-72). What's a giving out a few per hash function where the prefix length assigned increases with minimum multihash size?
If that needs to exist in code and we want code from different projects to be compatible then the code requires a spec
Then let's just assign the recommendations to a codec and we're done.
How is this different then my proposal? |
Every table entry has a cost as it represents another barrier to compatibility between implementations, as each implementation must be configured with a hashing function that matches the identifier. If a hashing function has a variable We cannot expect every multiformats implementation to have support for every hash function. We also cannot arbitrarily limit the number of new hash function entries in order to try and reduce proliferation because people who are using obscure hash functions have grounded reasons to do so and will likely sacrifice multihash compatibility if necessary. The only thing we can do is allow for optionality where we can, which is in the length. As far as I can tell that’s the only place we can, so we should. |
I’m recommending that we add a column to the table to capture recommended max sizes. What I think you’re recommending is adding new entires for each hash+length. |
How? If you're talking about library compatibility, such as between If you're talking about application layer compatibility then IMO having a specified length instead of an apprehensive documentation note is much better. Without it the probability that two multihash based applications end up incompatible with subtle bugs seems quite high. There's another benefit to using the multicodec table. When a new user shows up and says "I see you support the identity multihash up until 512 bits, but I'd like 10k please give me a codec" we can say "sure we're happy to assign a codec, but there's a reason nobody's using it so far do you want to discuss your use case". |
@mikeal @vmx I think what is @aschmahmann proposed is actually quite practical ( even though I dislike it for a reason detailed below ), if we attach sizing to multiformats table position. Just like unicode has predefined "planes", we could ( without breaking too much ) define the following:
I really dislike this because we are limiting stuff that was previously open, but if this will make all sides moderately happy: I think it's a compromise worth considering. What I take issue with however is:
This implies too strong "table ownership" dynamics. We are stewards at best. If a user wants to take the cost of a large CID, ad the penalty of a large multihash code: they should get one, no questions asked (asking as curiosity: sure, but not in "justify yourself" terms ). Unless the conversation is shifting from "we want to give implementers reasonable assumptions to work with" to the more authoritarian "I think doing X is bad, and I don't want to see anyone do it using my table". If we are having the latter conversation: we already lost our way. |
I don't know if we really need to define that as anything more strongly then a general policy (as opposed to a rule we could encode in a lang-multihash library). However, the general principle of leveraging the size of the multicodec table to allow specifying as many things as we want is indeed what I was getting at.
Fair enough. I agree we should assign codecs if people ask for them without needing a "reason" (as long as they're not doing things like asking for 1000 low bit codecs). However, as people who are involved in the project it may serve as a useful heads up to let a project know that while they can happily use a new codec what some of the tradeoffs involved might be (especially if it's as simple a change as increasing the hash length). |
@mikeal I'm against this. This would lead to discussions like this one. Which recommended size would you pick for the identify hash. It might also make implementations more difficult. I would then try to meet those recommendation. If we just say there are limits, I can keep it simple and e.g. limit all my hashes to a single maximum value. That's easy to explain and easy to reason about. |
@vmx we do not want to impose a cap for SHA2-256 of anything other than 256, and 512 for SHA2-512, and so on. Some hash functions have a known fixed length that should be imposed and some are variable, that’s why this needs to be another column, because it’s a bigger problem than just this generic max size. |
Wow, long thread. I just want to add that the 100 byte limit I imposed on ScanLinks in cbor-gen (here) is fairly arbitrary, and aimed at protecting certain performance critical code from overallocations (note: the const is only used inside of ScanLinks, the rest of the codebase imposes a limit of 512 on cids, another fairly arbitrary limit to help protect against DoS). I'm happy to change these limits (or make them configurable per use case) as people see fit. |
Yes I know. Though I guess it should follow the same logic as the identity hash, that we cannot really make good judgements about those limits. In case we want per hash recommendations (I don't), I wouldn't add another column (as it also doesn't apply to non Multihash Multicodecs), but just add it as a comments. |
Re-pinging all participants, especially with the newly attained understanding of where identity CIDs really shine. |
@ribasushi Can you please elaborate on where identity CIDs really shine. Side-note: I'm still not sure how far the concept of CIDs should be stretched. I like the simplicity of having an identifier for content-addressed things. To me "identity CIDs (I'd call them "inline CIDs") is a different concept. It makes use of multiformats (which is great), but are they really CIDs anymore? In order to make that work, we not even need to stretch the concept of CIDs (by not addressing the content, but inlining it), but also the one of Multihashes. They are (according to the current spec) about "cryptographic hash functions", which the identify function certainly isn't (e.g. it misses the pre-image resistance). So perhaps there should be a new multicodec called "IID", an "Inline Identifier", which is just: |
We have recommendations for block size limits but ultimately it’s up to different protocol implementations to enforce them. I don’t see a compelling reason to recommend or try to enforce any limits beyond that. Identity CIDs live in blocks and can accumulate quickly to exhaust block limits. Given that we have no real enforcement mechanism let’s just leave it as is. |
Adding to the discussion, why limit the size of identity multihashes when you could limit the size of the things that contain identity multihashes? Here's my thinking:
Beyond DNS, DAGs, and IPFS records, identity multihashes don't see much use, and from there the size restrictions should be determined by the developer that intends to use them outside these typical set of use-cases. As for developers making multihash parsers, the spec should include a minimum identity multihash size they must support (maybe make it equal to the largest non-identity multihash available?), however, it should be up to the developer to decide on an upper limit depending on the performance/security considerations that are being taken for the particular implementation. What do you think? |
Hey, I forgot about this discussion (thanks for reviving it @Winterhuman) but I've had my head in identity CIDs of late so this is relevant. We're dealing with some Filecoin retrieval issues involving identity CIDs which have been made by various tooling, most notably Lotus itself when creating UnixFS data for you with an I've seen this code copied in a couple of places, so we must have a cultural meme that's stuck with some folks that identity CIDs are a good idea when the blocks are <= 126 bytes. So this is happening, in the wild, and any limit should at least take that into account. The retrieval problem with Filecoin is interesting because it involves DAG root CID that are identities and are skipped by various parts of the storage and indexing layer because they are identity CIDs. But because they're overlooked, when a user shows up wanting to retrieve a DAG that starts as an identity CID (typically with multiple links in it), the retrieval fails because it can't find a stored piece with that CID in it. So a top-layer solution (there are other solutions to be explored focused on indexing, but for now we just want to make these retrievals work) is to accept identity CIDs and handle them as if they are a request for a piece containing all of the links in that CID and then start a retrieval based on that from a matching piece. But in the process of implementing that we have to make some decisions about limits, because we're not just dealing with the <=126 byte case, we're essentially opening up another retrieval mode where a user can craft their own identity CID containing lots of links and be requesting that a storage provider find a piece that happens to have all of those links in it. Having it open ended isn't really ideal because that could be a lot of work and from a client perspective just making a query is cheap and we don't want to make it expensive for storage providers. I've come up with some limits, but they're kind of arbitrary, but they at least feel like a compromise between "don't make me do too much work" and "this could be a useful mode for multi-DAG or just multi-block retrieval": filecoin-project/go-fil-markets#747
You can pack 32 average sized CIDv1s into a dag-cbor identity CID for around 1320 bytes. Too much? Too little? 🤷 But this is a specific path, so the limits here don't necessarily reflect what a reasonable limit might be for other uses. |
@rvagg It's definitely a use-case to keep in mind, and it also helps demonstrate one of my points. It would be much easier to just give a minimum max size of identity multihash that all multihash parsers must support than giving a definitive maximum size. For instance, your use-case is using identity multihashes for DAG roots, and so it has a large limit, however, something like Kubo would almost certainly support only up to the largest non-identity multihash to mitigate attack vectors. Just to discuss your idea further, you want to impose a 32 link maximum per identity CID, however, what would stop people from using identity CIDs as links, with further links in them, to go past the 32 link limit? I guess the 2048 bytes maximum would prevent it going further, but, it's something I wanted to bring up. |
Well, I added support for recursive identity CIDs here: filecoin-project/go-fil-markets@ffdf335 Because the recursive has to be inside the parent identity CID, you hit the byte limit pretty quick, and the total 32-link max is taken over the parent + recursives it contains. Any identity CID that an identity CID contains must be included in it, so the case is covered fairly neatly here. But to be clear - recursive identity CIDs are kind of insane, and even supporting them seems a little insane. There's some additional weirdness when it comes to Filecoin re identity CIDs that make it a tricky topic, mainly focused around the relationship between a DAG, the blocks in that DAG and the "piece" it's stored in. When you have an opaque blockstore that just returns blocks to you when you ask for them then a lot of those problems go away. But with Filecoin there are other factors that come in to play, so it's not opaque - do I have to unseal the piece to get you this block/DAG? do I have to charge you to send you this block/DAG? And identity CIDs essentially become identifiers—even if Filecoin were to just treat them as free retrievals, (a) do we want to force storage providers to answer random queries for "give me this identity CID"? and (b) what about DAGs that hang off such identity CIDs when you want to do a graph-based retrieval or when you're using an identity CID to identify some piece that you want to do a complete piece retrieval of. Ugh. It's a mess and leaves me believing identity CIDs were a mistake. |
This is objectively incorrect: any competently written dag assembler can naturally produce recursively-stacked identity CIDs: the input into the "decision function" of whether to wrap an identity is nothing but the size of the "block" based on the immediate linklist.
None of your current troubles are specific to identity CIDs as a concept: everything stems from the fact that folks were/are trying to be clever about handling them in a special way through the stack. That is: identity CIDs are not special in any way. Treating them as such outside of transport is the actual grave mistake we should stop perpetuating. |
I do have one question about inline CIDs. What's the difference between setting a link to
Is the difference practical? Conceptual? It's the one thing I've not quite understood when reading the whole conversation |
@Winterhuman in your example you have what looks like a dag-pb conforming block printed in dag-json. One reason to do bytes in a CID is that dag-pb is so strict with what it can include, it only has one place for bytes, the But outside of dag-pb, the question still stands—why use an identity CID to smuggle bytes if you could just include bytes? I suppose it comes down to your desired schema, like with dag-pb, perhaps you need something to be a link in that position, but instead of linking to a tiny raw block, you just inline it and save some space. The same could be true with other data forms too—maybe your application's schema wants a link to a block containing a list, but you have an empty list, encoding that as a dag-cbor empty list would be a very small identity CID and you'd end up saving space doing so (even with the de-dupe you might get from needing to do it multiple times). There are other cases too, like this one that's always slightly puzzled me: https://github.com/filecoin-project/specs-actors/blob/d8d9867f68a3c299295efdc6d1b3421c9b63df57/actors/builtin/codes.go - the identifiers for the builtin actors for Filecoin are identity CIDs containing the versioned name of the actor as bytes in a |
This post is for extrapolation of one of the reasons why I think we should discourage the use of identity CIDs in Filecoin (so I can link to this from a lotus discussion). This is mainly relevant for online deal flow, and mainly influenced by the historical precedent we set by including inline CIDs as their own block entries in CARs, although there are other reasons why it's not necessarily a great idea. When we include CIDs as separate blocks in CARs, the block data we're inlining shows up 3 times: in the original block with the link, then as the CID for a CAR section and then as the data for that same CAR section. It'll show up another time for a CARv2 index, which will be generated for SPs actively serving retrievals, but I've not factored that in below. We get de-duplication efficiencies if multiple blocks use the same inline CID but we'd get similar efficiencies if we were linking to the same block anyway, and beyond the size of a standard CID (~36 bytes) we rapidly end up losing any benefit. Ignoring de-duplication effects, the "savings" can be visualised like this (only CIDv1), going up to the 126 bytes that Lotus will currently inline: Storage cost increases ~linearly for storing the bytes as a separate linked block, with the linking CID and the CID:bytes CAR entry; i.e. the overhead is ~2 CIDs. For identity CIDs the increase is ~3x, with an overhead of 9 bytes. Break-even is at 32 bytes. |
Is anyone in this thread going to squeal if we put a hard limit in both js-multiformats and go-cid, and also a suggestion in the spec, that would error if they are asked to decode (or encode) an identity CID greater than 2048 bytes? Try and ignore for a moment that special-casing one specific hash function is a bit weird, would you actually care of this was done? If not, what alternative limit would make you happier? |
As a frequent user of |
Yep no problem. We've long since migrated off identity hashes for anything but small public keys, and we hard limit them to 36 bytes there. |
As I've been working on the Rust implementation of Multihash, it came up that the identity hash currently doesn't specify any limits. From an optimization perspective (this is why it came up in Rust), but also from a security perspective I think it would make sense to specify an upper bound for its size.
I personally would take a quite low limit which is similar to what current hash functions have as length. So perhaps something around 64 bytes?
The text was updated successfully, but these errors were encountered: