Verifiable Deal Aggregation #283

Kubuxu · 2022-01-31T21:36:40Z

Kubuxu
Jan 31, 2022
Collaborator

With deal aggregators emerging and gaining popularity, it is much easier for clients to make deals for small pieces of data.
This is great for ecosystem growth, unfortunately, it causes clients to lose one important capability:
the ability to verify that their data is stored inside a sector.

It is essential to recognise the importance of this capability. Verifiability is one of the core properties web3 ecosystem, whether it would be for the purpose of a single user or to enable the composition of services. Without Verifiable Deal Aggregation, possible use cases like data availability for L2 protocols, storing NFTs or small websites cannot evolve beyond using trusted aggregators.

Below I propose two solutions for Verifiable Deal Aggregation, one of them is simpler but constrains sub-deal sizes and alignment. The other is more complex in design and execution but allows for more flexible sub-deal sizes.

Simple Verifiable Deal Aggregation

As the name suggests, this protocol is quite simple, but the sub-deal size is constrained to be a power of two after padding.

Protocol:

Client computes PieceCID of their data segment (ClientPCID).
Client asks Deal Aggregator to store their data
Deal Aggregator packs clients' sub-deals into a deal, according to sector packing rules (power of two packing).
Deal Aggregator makes a deal with a Storage Provider, SP publishes the deal.
Deal Aggregator shares SubPieceInfos for a PieceCID in which Client's data was included and DealID.
Client verifies that:
1. Their ClientPCID is included in the SubPieceInfos with the correct size.
2. Computing PieceCID out of SubPieceInfos results in the same PieceCID as in a deal under the given DealID.

This protocol reuses facilities for computing UnsealedSectorCID out of PieceCIDs, which, while simple, has one crucial drawback.
Sizes of deals (and thus sub-deals) in this framework are limited to 32*2^{N}, where N is an integer to between 2 and 30 inclusive, for data size after padding (128/127th of unpadded data size).

The ComputeUnsealedSectorCID (or equivalent) is fast enough to be exposed on-chain, and the amount of data needed to prove sub-deal inclusion on-chain is minimal (with optimisation, the maximum is 32*(30-log2(SizeOfDeal/32)) and the expected size would be in the area of 128-512B).

Adaptive Verifiable Deal Aggregation

The main draw of this scheme is that the deals don't have to be power-of-2 aligned, increasing the flexibility of deal aggregation at the price of increased verification cost. The verification costs and proof size scales with the misalignment of the sub-deal data.

The misalignment of M can be explained as follows:

M=0, data is aligned to its power-of-two size.
M=1, data is misaligned by half of its power-of-two size. 400MB deal, rounded up to 512MiB, misaligned by 256MB (at offset x * 256MB)
M=2, data is misaligned by a quarter of its power of two sizes, 512MiB deal, misaligned by 128MB (at offset x * 128MB)
M=M, data is misaligned by Pow2Size/(2^M), at offset x * Pow2Size/(2^M)

The effective size of the deal is also rounded up to the Pow2Size/(2^M). For example, a 300MiB deal, which normally would be rounded up and utilise 512MiB of space, with a M=2 will utilise 384MiB of sector space, leaving room for inserting 128MiB deal after another deal 512MiB deal with M=2 or 1GiB deal with M=3.

Higher misalignment values require more inclusions proofs to prove the misaligned data. The number of required inclusion proofs is 2^M+(0 or 1) with half of them M+log2(DealSize/SubDealSize) long and half M long.

This adaptive property means that:

deals that are meant to be verified later on-chain can use M=0, achieving the same verification cost as Simple Verifiable Deal Aggregation
deals that clients want to verify off-chain, a higher M value of M=3 or M=4 can be used, such that size and cost of verifying the inclusion proofs is not too high
deals that don't need to be verified can use any alignment

This protocol version allows the best of both worlds while possibly enabling all aggregated deals to be verifiable. The simple protocol has less chance of adoption for all aggregate deals as the expected deal size overhead due to rounding to next power of two is 25%, this overhead is halved for adaptive verifiable deal aggregation with M=1 or reduced to 6.25% for M=2.

cc @nicola @ribasushi @anorth

EDIT 1: The Adaptive proof cost changed from 4^M to 2^M+(0 or 1).

mikeal · 2022-01-31T21:45:08Z

mikeal
Jan 31, 2022

this looks incredible, thanks @Kubuxu!

this would really help us speed up our write pipelines in *.storage :)

0 replies

aschmahmann · 2022-01-31T22:29:10Z

aschmahmann
Jan 31, 2022

It seems like in order for this to work you need some zeroes lying around in here to pad things out to the correct M values.

If so does this require aggregators shipping data to SPs without using an IPLD transport, require some sideband communication detailing how to transform the IPLD data, or am I missing something?

My understanding of the current system from https://spec.filecoin.io/#section-systems.filecoin_files.piece is that the way data is currently ingested into SPs in the "online deal flow" is:

An aggregator gets a set of DAGs and combines them into a bigger DAG that when serialized as a CARv1 file will be <32GiB
The aggregator computes the CommP over the serialized CARv1 of the bigger DAG with some zeroes padded at the end to make it the same size as a sector (e.g. 32GiB)
The aggregator sets up a deal with an SP to store some data represented by the CommP
The aggregator sends the bigger DAG to the SP via GraphSync
The SP reassembles the DAG received via GraphSync into a CARv1 with some zeroes padded at the end to make it the same size as a sector (e.g. 32GiB)
a. The SP also creates an index of where in the CARv1 each block of data is so that it can be retrieved by end users looking to download IPLD DAGs
The SP verifies the data matches the CommP it promised to store
Data is sealed and committed to on chain

If we need to insert zeroes in the middle of the sector to get the correct deal padding for each DAG this implies that either:

The aggregators need to send the sector bytes rather than an IPLD DAG to the SP
The aggregators and SPs need a (new/another) way of signaling a consistent way for the set of small "deal" DAGs to be encoded into a sector

Note that while option 1 sounds nice and easy we'd still likely need some standard format for the sectors so that SPs could create an index mapping of the blocks being stored so that they can make them retrievable for end users who want to request some IPLD DAG stored by the SP.

3 replies

Kubuxu Jan 31, 2022
Collaborator Author

This would be one way to do it but I don't think it is the only way.
Zero padding is the default, so if the reassembled CAR is shorter than the 32GiB required, it isn't an issue.
What is a problem is that the aggregator needs to send multiple deals within one CAR while preserving parts of sub-CARs at given aligments.

One solution to that would be to allow multiple CARs to be placed within a single deal, it would require handling on the Aggregator and SP side (upgrade to go-fil-markets). In essence, the aggregator would instruct SP: for this deal, fetch CidA and place it at offset 0, CidB at offset 128MiB, CidC at offset 384MiB all within one deal of size 640MiB

Another solution would be to tactically add blocks within CAR/DAG for padding/alignment purposes. The traversal order is depth-first, the primary issue is that you need to somehow know the length of the padding early on because root node and traversal nodes are already written by the point you get to the padding.

The simplest alternative, as you mentioned would be to send sector bytes but I think that is off the table for this use case.

Kubuxu Jan 31, 2022
Collaborator Author

On the other hand, it might be a perfect use-case for multi-root CAR, have each sub-deal be a separate CAR root and after that a root for padding. One issue might be that CAR is underspecified when it comes to duplicate blocks, as it comes to the padding that is simple to resolve by just adding an autoincrementing counter to the padding but what if there is data duplication between clients.

aschmahmann Jan 31, 2022

Yep, none of these are deal breakers at all and we should absolutely be able to do these sorts of proofs. It'd be a shame if we didn't feel we could do merkle proofs on merkle trees 😄.

I mostly wanted to flag that no matter how you do this the way in which people turn DAGs/sets of DAGs into sectors is going to change.

I recall being told that some folks in the ecosystem were relying on ipfs dag export happening to result in the particular type of deterministic CARv1 file used in Filecoin today (cc @ribasushi), so it might be that people need to be able to choose between the old and new encoding formats to preserve some of the assumptions that people may be making today.

One issue might be that CAR is underspecified when it comes to duplicate blocks

I don't see why this is an issue. IIUC the thing that we're describing is effectively a spec for how Filecoin SPs convert DAGs into sectors. It should be possible to specify whatever we want here by either agreeing on a "best" option ahead of time or by supporting multiple options in the deal negotiation/transfer protocols.

multi-root CAR

Not specific to multi-root CARs, but in general if we change the DAG -> sector mapping we may want to change our recommendations as to what goes into the memo field for deals here depending on the purpose of the memo field. For example, if the purpose of the field is to be able to reproduce DAG -> sector mappings then some information about the algorithm used for the mapping should be included.

nikkolasg · 2022-02-01T13:09:40Z

nikkolasg
Feb 1, 2022

First, I think this would be a great thing to have.
However, it looks to me we are trying to fix the symptoms rather than the root cause. The reason we have these weird paddings and such is because SP only have the choice to emboard large discrete unit of storage (32GB) so there needs to be efficient packing of client deals. We talked a bit in Madrid about abstracting that notion of "sectors" away (and already did some work around that in exponential growth effort) - that + having a more "streamlined" porep (f possible, some of the proposals seemed like it) seem to fundamentally solve the problem rather than fixing around the cumbersome proofs we have now.

Again, I think it's great that Kuba started this FIP, but just want to mention some of my worries, things we should not loose sight of.

2 replies

Kubuxu Feb 1, 2022
Collaborator Author

I think it is quite a bit orthogonal to this.
Even if we had flexible sector sizes, would we go to length of allowing non-power-of-two sector size? Even if we did there still is an issue of small deals being too expensive to be performed on chain (the on-chain footprint is independent of the deal size).

nikkolasg Feb 2, 2022

For sure the end game would be to have deals on a L2 kinda chain. And with recursive proofs systems (and more granular porep - both are "plausible"), we could allow much more granular file size. But definitely more work. Again, not saying we shouldn't do it, just saying that we shouldn't loose track of the holy grail version of what we want to achieve.

anorth · 2022-02-02T04:50:13Z

anorth
Feb 2, 2022
Maintainer

Nice work.

What actor hooks would we need for a subdeal client that is an actor to be able to confirm that its data was included in a deal made by an aggregator?

Read a deal's PieceCID
The aggregator's published SubPieceInfos
A method to compute PieceCID from SubPieceInfos (which is a pure function, so can be implemented anywhere)
Anything else?

For the simple case would it be better in any way for the aggregator to provide inclusion proofs rather than the set of SubPieceInfos. In total, the SubPieceInfos are less data to prove all sub-pieces, but for any individual client, would an inclusion proof be smaller or easier to verify?

Do any of these answers change for the adaptive version?

1 reply

Kubuxu Feb 2, 2022
Collaborator Author

The aggregator's published SubPieceInfos

If we are talking about on-chain publishing, then the SubPieceInfos can be published by anyone (client, the aggregator, third party).

Anything else?

I don't think so, you listed everything needed to go from DealID -> PieceCID -> SubPieceID.
Of course, the ability to verify that a given deal is active is given.

In total, the SubPieceInfos are less data to prove all sub-pieces, but for any individual client, would an inclusion proof be smaller or easier to verify?

SubPieceInfos (as are PieceInfos) are a form of "aggregated" inclusion proof to the same root. Your observation is correct, for an individual client an inclusion proof unique to them is more optimal but it all depends on who publishes it.

If it is the aggregator that publishes on-chain then it is better to publish SubPieceInfos, if it is a client then the individual proof is better.

On the other hand, given the SubPieceInfos, one can craft the individual proof. This leads me to think that we should investigate a verification scheme where both are compact and accepted.
In the end, we can expect an actor, which in its state stores SubPieceID, and is expecting this SubPieceID to be included in some deal. This expectation can be resolved by submitting (DealID, proof), where the proof could be an individual inclusion proof or an aggregate version of that proof. The reason for accepting aggregate versions is that they can be stored on-chain once and used multiple times.

Simple Version is Adaptive constrained to M=0, assuming we are interested with on-chain verification only for sub-deals with M=0, nothing changes (apart from logic for building an aggregate version of the proof becoming a bit harder). Deal can contain sub-deals with varied M.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verifiable Deal Aggregation #283

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Verifiable Deal Aggregation #283

Kubuxu Jan 31, 2022 Collaborator

Simple Verifiable Deal Aggregation

Protocol:

Adaptive Verifiable Deal Aggregation

Replies: 4 comments · 6 replies

mikeal Jan 31, 2022

aschmahmann Jan 31, 2022

Kubuxu Jan 31, 2022 Collaborator Author

Kubuxu Jan 31, 2022 Collaborator Author

aschmahmann Jan 31, 2022

nikkolasg Feb 1, 2022

Kubuxu Feb 1, 2022 Collaborator Author

nikkolasg Feb 2, 2022

anorth Feb 2, 2022 Maintainer

Kubuxu Feb 2, 2022 Collaborator Author

Kubuxu
Jan 31, 2022
Collaborator

Replies: 4 comments 6 replies

mikeal
Jan 31, 2022

aschmahmann
Jan 31, 2022

Kubuxu Jan 31, 2022
Collaborator Author

Kubuxu Jan 31, 2022
Collaborator Author

nikkolasg
Feb 1, 2022

Kubuxu Feb 1, 2022
Collaborator Author

anorth
Feb 2, 2022
Maintainer

Kubuxu Feb 2, 2022
Collaborator Author