feat: blob protocol draft #115

Gozala · 2024-03-19T06:13:05Z

Based on #114. PR is written with UCAN 1.0 format and assuming storacha/RFC#12 however in terms of immediate implementation I suggest that we instead

Use blob/add instead of /space/content/add/blob
Use web3.storage/blob/* in place of /service/blob/

I suggest above because I think it would make more sense to transition to storacha/RFC#12 once we have [email protected] implemented, because I suspect links to tasks vs invocations are going to be a pain to deal with otherwise. This will give us cleaner break.

In terms of implementing /service/blob/accept and specifically how does client signal that they've completed upload I suggest that we do whichever is easiest option from following two:

Make client sent second blob/add invocation after they've done upload so we can perform a lookup.
Add another temp capability with empty output either very specific like blob/add/poll or very general like invocation/wake { task: CID } .

vasco-santos

Thanks for creating this ❤️ we are generally quite close on it

w3-blob.md

vasco-santos · 2024-03-19T13:39:45Z

w3-blob.md

+
+###### Add Blob Committed
+
+Capability provider MUST issue receipt with `LocationClaim` capability delegated to the [space] when requested blob is added to the space, which can happen if provider already has a matching blob locally.


I am confused here. This should only happen when bytes make its way to the bucket right? (based also on the diagram you previously shared)

I think this is poorly worded, I'll fix it but what I meant to say that if provider already has such blob they can issue "committed" case of the ok result as opposed to "allocated".

You are right that if provider does not already have a blob we are not going to issue LocationClaim until we know we have it.

w3-blob.md

hannahhoward

Is the intent for blob/* to eventually replace store/*?

Do we have the work to implement this tracked anywhere? (seems like it's not trivial amount of work -- should probably ship with spec)

hannahhoward

Think we need agreement at minimum on prioritizing blob/* work in client and server before we merge this spec. Otherwise it's a set of actions that aren't implemented

vasco-santos · 2024-03-19T15:15:31Z

@hannahhoward yes! This is our way forward with write to anywhere to avoid the migration pains. But we need to include in the write to anywhere ticket

Gozala · 2024-03-19T16:27:32Z

Is the intent for blob/* to eventually replace store/*?

Do we have the work to implement this tracked anywhere?

I do think we want to some day, but there is no urgency to do so. We do want to introduce new blob capabilities here as they do not imply bucket events and we want to migrate clients to that.

(seems like it's not trivial amount of work -- should probably ship with spec)

Generally we have been aligning on spec first and only then going about implementing it, sometimes that imply changes to spec but alignment on how seemed to be a good way to avoid changing code back and forth.

Gozala

comment

w3-blob.md

hannahhoward · 2024-03-19T17:08:31Z

So blob/* is the new write to anywhere api?

This needs clarification in the PR description

w3-blob.md

hannahhoward

For now, I think we should simply make the audience of the location claim public. The current flow implies a redelegation of authority every time the location changes.

I'm concerned we're thinking we can get private data for free here without thinking through the implications. Private retrieval is a huge topic -- I think we should design for it intentionally, rather than throw in features that may create lots of headaches on a theory it will be sufficient for private data. I think privacy should be set at the ingress point (i.e. by the space owner on the blob itself, absent the location, either during blob/add or through some seperate update apit), not on the location claim itself which is temporary.

Either way, I think we should move it out of our current design, because this requires more intention.

hannahhoward · 2024-03-19T17:32:03Z

Realizing removing the redelegation requires invoking IPNI/offer and/or store/publish as seperate capabilities. Again, this is I think the right approach.

If a location is managed by the provider, it must be responsibile for keeping location claims, wherever they reside, updated.

Gozala · 2024-03-19T17:52:38Z

If a location is managed by the provider, it must be responsibile for keeping location claims, wherever they reside, updated.

What we settled on in the storacha/RFC#13 that our issued location claims will have HTTP URLs with hints of where content is. Those URLs will simply route / redirect to the location content is in the system.

This enables us to make long term location commitments while retaining ability to change actual site of the content over time.

Does this address the concern you're raising ?

Gozala · 2024-03-19T18:05:03Z

For now, I think we should simply make the audience of the location claim public. The current flow implies a redelegation of authority every time the location changes.

I think this is inaccurate, as mentioned in prior comment per storacha/RFC#13 we intend to make issued location changes agnostic of content site changes.

I'm concerned we're thinking we can get private data for free here without thinking through the implications. Private retrieval is a huge topic -- I think we should design for it intentionally, rather than throw in features that may create lots of headaches on a theory it will be sufficient for private data. I think privacy should be set at the ingress point (i.e. by the space owner on the blob itself, absent the location, either during blob/add or through some seperate update apit), not on the location claim itself which is temporary.

I think there is a misunderstanding here. I also want to assure you that I have been thinking about private retrievals for more than a year now it's just there were always higher priorities at play.

I'd like to step back from privacy here for moment. What is proposed here is not to assume that adding a blob implies you want it to be indexed, advertised and served across all possible channels. Instead we can issue a commitment to the space itself that we will serve the blob from the given URL through the commitment validity window. This commitment can be published by user if they want to make that data available publicly. This also creates an opportunity to create a system where our commitments can be verified and we can be held accountable if we fail to uphold them.

In other words it gives user a choice to do what they want with our commitment to serve the blob.

Co-authored-by: Vasco Santos <[email protected]>

…/store-api-revamp

…/location-claims

hannahhoward · 2024-03-19T19:43:07Z

Thinking back to storacha/RFC#13 I now see this is a bit of a misnomer as a solution.

Yes, it's a public URL that will continue to work, if you go through w3s.link.

It still has an expiration and the data still could move. Presumably if I have https://w3s.link/ipfs/bafy...BLOBCID?origin=r2://region/bucketName/key and then it gets moved, the service would want to publish at minimum a second claim with a better hint -- i.e. https://w3s.link/ipfs/bafy...BLOBCID?origin=r2://region2/bucketName2/key2-- even if the first continued to be valid. But how does the client know to re-delegate this new claim in order to publish it? Or if the claim expires, and the publisher wants the new claim (potentially still at the same location) re-published, are they in charge of monitoring that?

Also, unless I misunderstand, we are trying to get to a public mechanism whereby a client can find the location of the car and its index, and do range reads. How would it do so if the hints become out of date or the claim expires? Would it talk directly to the w3s content claims service? How would it know to do so (assuming it wasn't a web3.storage exclusive retrieval client)? Or would it be forced to fetch from w3s.link instead at that point?

Ultimately, I think we're dealing with two different chains of authority we're trying to smash into one here:

Authority to manage the visibility / discoverability of a piece of data
Authority to move that data around and update its location

I think we should probably take a step back and sync on all this. I feel that these are valid concerns, not just nitpicking, especially as we look forward to the architecture we're building.

My proposal is to hold on this till we can sync during check-in on Thursday (in the interest of not adding meetings, use what's scheduled)

Note: I'd prefer not to mess up @vasco-santos ability to implement, so I'm ok if he starts working in the meantime. But I think we're not in alignment completely yet.

hannahhoward

Removing objections after synchronous discussion. Will document reasoning shortly.

hannahhoward · 2024-03-20T02:47:13Z

Documentation about why I removed my objections:

A public location is in fact a commitment to make data available at a given URL for the length of the claim. Therefore, it makes sense that it's ok to let the user choose to then share the claim with the wider audience
docs: datawherehouse location claim + store/publish RFC#13 is intended to enable this - in the sense that it enables the provider to make a claim that data is available at a given URL, and maintain the URL in a context where it moves around internally
The same PR is not a solution for the point where we have a network of providers. But that's ok -- we can explore other solution when we get there, either by having those providers publish claims themselves, or by using w3s as a router in claims. There is probably more design work to do here.

Also: I can see a world where the right solution is to have the location claims be issued by provider, and then retrieved when you do blob/get. I wonder if we should treat "repair" -- where a provider fails to satisfy their claims -- as a seperate service.

vasco-santos · 2024-03-22T10:35:31Z

w3-blob.md

+type Blob = {
+  content: Multihash
+  size:    int
+}


We should allow user specify an expiry here so we can reflect it on our end with accept.

I think you may have misunderstood my intention behind exp in accept, there it meant to capture pre-signed url expiry. While here I would expect it to imply that blob should be GC-ed at given time.

Am I interpreting your intentions correctly here ? If so I'd say lets not add this until we actually are able to support GC-ing blobs, when we do we can introduce this field.

yes, I was considering this to be the expiration we talked about for expiration of claim #115 (comment)

but than can come as follow up

w3-blob.md

vasco-santos · 2024-03-22T13:31:42Z

w3-blob.md

+
+### Allocation Effects
+
+Allocation MUST have no effects.


I was thinking this would actually point to Accept as well, given next step is that after allocation. Why not?

Perhaps 🤔, but here here is my sentiment

Can I allocate space without waiting for it to be received ? If so, this is a function to do and it should not have a continuation there. If no, I think I need bit more convincing because here are my counter arguments

Overall I think you could roughly translate blob/add as following JS code

const blobAdd = async (task) => { const { sub, args: { blob, exp } } = task // check that we have billing setup for the space const { ok: subscription, error: cantbill } = await service.getSubscription(sub) if (cantbill) { return { error: cantbill } } // allocate memory for the blob const { ok: memory, error: allocError } = await service.allocate(sub, { exp, blob, cause: task.link() }) if (allocError) { return { error: allocError } } // add usage to the billing if (memory.size > 0) { await subscription.bill.add({ storage: size, cause: task.link() }) } // if we have memory address we need to await for the content if (memory.address) { return { ok: await accept({ blob, exp }) } } else { return { ok: findReceiptFor({ accept, blob, exp }) } } } const accept = async ({ blob, exp }) => { const { ok: content, error: timeout } = await waitForPUT({ blob }) if (timeout) { return { error: timeout } } // validate content matches the blob const { error } = await vaildate(content, blob) if (error) { return { error } } return { ok: createLocationClaim(blob) }) }

And through above lens, addBlob is a task that coordinates subtasks, but subtasks allocate and accept are independent of each other.

If we made accept effect of the allocate, then we probably should not make it effect of the blobAdd also. However making it effect of the blobAdd is better because we can then draw the whole execution graph from the receipt without having to follow each effect.

For what it's worth these observations and opinions are new to me also and I have call with Fission on Monday to try and get their input on this as they spend a lot more time doing and implementing this.

thanks for the details, I find this a good reasoning. My suggestion was more about having a receipt for a blob/allocate and not really linking it to anything. But we have as part of the task arguments the blob/add who caused this, so I think that also mitigates what I was looking for

vasco-santos

There are a few details in comment, as well as remove/list to add here. But I am happy to cover them in follow up :)

w3-blob.md

.github/workflows/words-to-ignore.txt

w3-blob.md

vasco-santos

LGTM

Gozala added 2 commits March 18, 2024 21:26

feat: revamp store & upload specs

1ca5aa5

draft: blob spec

6570185

Gozala requested a review from vasco-santos March 19, 2024 06:13

vasco-santos requested changes Mar 19, 2024

View reviewed changes

hannahhoward reviewed Mar 19, 2024

View reviewed changes

hannahhoward requested changes Mar 19, 2024

View reviewed changes

Gozala added 3 commits March 19, 2024 09:22

fix: ignore some words

9544c3e

fix: grammar

7f9c548

fix: ignore some words

0ab6911

Gozala added 2 commits March 19, 2024 09:30

chore: ignore more words

90e9607

chore: ignore more words

4f94fa9

Gozala commented Mar 19, 2024

View reviewed changes

hannahhoward reviewed Mar 19, 2024

View reviewed changes

w3-blob.md Outdated Show resolved Hide resolved

w3-blob.md Outdated Show resolved Hide resolved

Gozala commented Mar 19, 2024

View reviewed changes

w3-blob.md Outdated Show resolved Hide resolved

hannahhoward requested changes Mar 19, 2024

View reviewed changes

Gozala and others added 9 commits March 19, 2024 11:14

Apply suggestions from code review

6ede80c

Co-authored-by: Vasco Santos <[email protected]>

fix: spelling errors

192c183

Merge remote-tracking branch 'origin/feat/store-api-revamp' into feat…

7b7bdc3

…/store-api-revamp

fix: broken link

6784ecd

fix: md lint issues

e95aab2

chore: remove double blank

5751177

fix: capture allocation fields

727d905

fix: lint

0140d3e

Merge remote-tracking branch 'origin/feat/store-api-revamp' into feat…

86705a2

…/location-claims

fix: broken link

b292df3

hannahhoward approved these changes Mar 19, 2024

View reviewed changes

Base automatically changed from feat/store-api-revamp to main March 22, 2024 04:46

chore: second draft

6a60a0a

vasco-santos reviewed Mar 22, 2024

View reviewed changes

w3-blob.md Show resolved Hide resolved

vasco-santos reviewed Mar 22, 2024

View reviewed changes

vasco-santos approved these changes Mar 22, 2024

View reviewed changes

vasco-santos reviewed Mar 22, 2024

View reviewed changes

w3-blob.md Show resolved Hide resolved