Skip to content
This repository has been archived by the owner on Aug 11, 2021. It is now read-only.

IPLD Format API cleanups #51

Merged
merged 1 commit into from
May 8, 2019
Merged

IPLD Format API cleanups #51

merged 1 commit into from
May 8, 2019

Conversation

vmx
Copy link
Member

@vmx vmx commented Mar 20, 2019

It is now clearer that the deserialzed data needs to have only types
specified by the IPLD Data Model.

BREAKING CHANGE: Switch from callback-based interface to a Promise-based one

The major change is switching to Promises for the function signatures.

Other changes:

  • The multicodec propery is now called format
  • tree() returns an Iterable

/cc @Gozala

@vmx vmx requested a review from rvagg March 20, 2019 17:15
@ghost ghost assigned vmx Mar 20, 2019
@ghost ghost added the in progress label Mar 20, 2019
@vmx vmx mentioned this pull request Mar 20, 2019
5 tasks
Copy link

@Gozala Gozala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a better direction than more dramatic changes proposed in an earlier version.

I would just clarify few things so that expectations are absolutely clear to the implemeter:

  • Is return value a promise, or actual value or both ? I think it should be clear and reader should be adviced which option to prefer in which case. e.g. do not return promise unless underlying implementation requires it that would allow IPFS to better optimize throughput.
  • Is error part of API ? When implemeter is expected to error and how ? What happens when error occurs ?

I also find util / resolver seperation confusing & it’s unclear what is the purpose of it ? Maybe this is an opportunity to remove that separation and switch to a flat structure instead

README.md Outdated

`callback` must have the signature `function (err, dagNode)`, where `err` is an Error if the function fails and `dagNode` is the dagNode that got deserialized in the process.
Returns a Promise containing the Javascript object. This object must be able to be serialized with a `serialize()` call.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe go as far as saying await serialize(await deserialize(binaryBlob)) should resolve to bytes equal to binaryBlob

README.md Outdated

> resolves a path in block, returns the value and or a link and the partial missing path. This way the IPLD Resolver can fetch the link and continue to resolve.
> Resolves a path within the blob, returns the value and or a link and the partial missing path. This way the `js-ipld` can fetch the link and continue to resolve.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth giving this structure a name e.g Cursor to ease communication

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used ResolverResult. I don't like it much, but I also didn't like Cursor :) Ideas for a better names are welcome.

@@ -114,25 +125,26 @@ If `path` is the root `/`, the result is a nested object that contains all paths

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example is no longer up to date here

- value: <> - The value resolved or an IPLD link if it was unable to resolve it through.
- remainderPath: <> - The remaining path that was not resolved under block scope.
- `value` (`IPLD Data`): the value resolved, on of those from the [IPLD Data model](https://github.com/ipld/specs/blob/master/IPLD-Data-Model-v1.md)
- remainderPath (`string`): the remaining path that was not resolved under block scope

If `path` is the root `/`, the result is a nested object that contains all paths that `tree()` returns. The values are the same as accessing them directly with the full path. Example:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn’t it still be {value:..., remainderPath:''} ? Otherwise caller will need to deal with the fact that sometimes it’s cursor and at other times it’s not, and what if node happens to have remainderPath field.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I’m misunderstanding if path is / resulting cursor will have value equal to one returned by deserialize(bytes). If so worth mentioning IMO, if not so also worth pointing that out

README.md Outdated

`IpldNode` is a previously deserialized binary blob.

Returns a Promise containing a `Uint8Array` with the serialized version of the given IPLD Node.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it return Uint8Array without wrapping it in a promise ? If so worth mentioning that.


#### `util.deserialize(binaryBlob, callback)`
The result is a JavaScript object. Its fields are the public API that can be resolved through. It’s up to the format to add convenient methods for manipulating the data.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant to say “returns promise for JS object representing a node...”

I also assume that wrapping into promise isn’t required, however it’s worth communicating


> get the CID of a binary blob
#### `util.cid(binaryBlob[, options])`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can options made non optional ? That would simplify implemeter side of things as of consumption it’s always indirect anyway (as far as I can tell)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that normally you would just call it without any options as you would want to use the default values for the CID version and the hash algorithm.

@vmx
Copy link
Member Author

vmx commented Mar 20, 2019

  • Is return value a promise, or actual value or both ? I think it should be clear and reader should be adviced which option to prefer in which case. e.g. do not return promise unless underlying implementation requires it that would allow IPFS to better optimize throughput.

I thought that if I want to be able to return a Promise, I would always need to (I was actually already concerned about the perf implications). But after a quick test it looks like I can also await a sync fun. But wouldn't this mean that it's always optional to return a Promise (sounds like I need to read more about Promises)?

  • Is error part of API ? When implemeter is expected to error and how ? What happens when error occurs ?

I should indeed add more about that in the doc.

I also find util / resolver seperation confusing & it’s unclear what is the purpose of it ? Maybe this is an opportunity to remove that separation and switch to a flat structure instead

Me too. Though I see this as a first iteration we can ship quickly. For example deserialize() and resolve('/') is indeed the same. So deserialize() could be removed. But again, I didn't want to make too many changes.

@Gozala
Copy link

Gozala commented Mar 20, 2019

I thought that if I want to be able to return a Promise, I would always need to (I was actually already concerned about the perf implications). But after a quick test it looks like I can also await a sync fun. But wouldn't this mean that it's always optional to return a Promise (sounds like I need to read more about Promises)?

Yes you can await on any value, promise or not (in fact if it has .then(suceed, fail) it will be invoked and resumed with whatever’s passed to succeed or thow whatever’s passed to fail).

However if I’m not mistaken await will still wait for a tick, so if you want to avoid that consumption will need to be a bit more complex, but possible.

@Gozala
Copy link

Gozala commented Mar 20, 2019

Me too. Though I see this as a first iteration we can ship quickly. For example deserialize() and resolve('/') is indeed the same. So deserialize() could be removed. But again, I didn't want to make too many changes.

I would still leave deserialize and possibly make resolve be an optional (optimization route).

Also resolve needs to wrap result into cursor so it’s not exactly the same

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, help me out here because I'm seeing a bigger breaking change than just async: the current version uses a format-specific "dagNode", which is returned by deserialize(blob) and required by serialize(blob). You then use resolve() on that "dagNode" to get at properties of it and even instantiate a full JavaScript object representing the data. I believe that's right isn't it? js-ipld-dag-pb dedicates a heap of code to this.

This new implementation seems to do away with that and talks about an "IPLD Node", which should take the shape of the deserialized data itself? (Object.keys(thing) gives you keys of the underlying data, presumably giving you the ability to use getters on it to lazy decode? but not asynchronously then). "add convenient methods for manipulating the data" suggests it can be decorated with additional methods? Maybe they can't be enumerable but they also couldn't clash with possible key values of the data, which leads to the toString / valueOf problem. "deserialize() and resolve('/') is indeed the same" suggests that the object returned by deserialize() is the actual data because isn't resolve() responsible for decoding the node into a proper JavaScript object?

I think I like the original version, or at least my reading of it, just async deserialize(), it returns some format-specific object, serialize() knows how to deal with only that type of object, resolve() lets you get at the underlying data in a predictable and uniform way across formats and can even be used to take the format-specific object and instantiate a plain old JS object from it. Async resolve() solves all of the the laziness problems raised in #50.

In line with this, the Definitions section needs to be edited or removed, it's referring to "dagNode" still while the rest has moved on to this new "IPLD Node" and IpldNode language.

README.md Outdated

`callback` must have the signature `function (err, binaryBlob)`, where `err` is an Error is the function fails and `binaryBlob` is a Buffer containing the serialized version.
> Deserialize into internal representation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*"into the internal representation of an IPLD Node"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was actually completely wrong. Correct is: "Deserialize a binary blob into an IPLD Node."

README.md Outdated

- value: <> - The value resolved or an IPLD link if it was unable to resolve it through.
- remainderPath: <> - The remaining path that was not resolved under block scope.
- `value` (`IPLD Data`): the value resolved, on of those from the [IPLD Data model](https://github.com/ipld/specs/blob/master/IPLD-Data-Model-v1.md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the value resolved, on of those from the" -> "the value resolved, whose type is one of the"

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@Gozala
Copy link

Gozala commented Mar 21, 2019

So, help me out here because I'm seeing a bigger breaking change than just async: the current version uses a format-specific "dagNode", which is returned by deserialize(blob)

@rvagg Not sure what you're getting at here, deserialize does exactly the same thing as it used to do before, except in the past it took callback and passed node to it and now it returns a promise that resolves to that node instead.

and required by serialize(blob).

You mean serialize(node) I presume, which again does what it used to except instead of callback mechanism it uses a promise.

You then use resolve() on that "dagNode" to get at properties of it and even instantiate a full JavaScript object representing the data. I believe that's right isn't it?

I think you're misunderstanding something here. resolve(blob, path) also does what it used to except API is promise based now. Please note that resolve doesn't take dagNode it takes serialization of one so blob and a path and decodes only what corresponds to the path.

In other words decode(blob) decodes the whole think while resolve(blob, '/x') decodes just whatever /x corresponds to. Internally it could still do const node = await decode(blob); return node.x but it also has an opportunity (where format allows) to avoid decoding irrelevant bits.

That is also why I'm suggesting to make resolve optional, that way formats that can't do partial decoding or don't want to bother doing it will be free of that burden, default resolve could just decode the whole thing and extract value that corresponds to the given path.

I hope this clarifies it.

js-ipld-dag-pb dedicates a heap of code to this.

Not familiar with that code but hopefully above explains it.

This new implementation seems to do away with that and talks about an "IPLD Node", which should take the shape of the deserialized data itself? (Object.keys(thing) gives you keys of the underlying data, presumably giving you the ability to use getters on it to lazy decode? but not asynchronously then). "add convenient methods for manipulating the data" suggests it can be decorated with additional methods?

I don't believe it has anything to do with lazy decoding. I would assume it's about convenience methods like toJSON() or some computed fields. Personally I'm in favor of keeping data plain and exposing behavior as functions but I can see why some might choose to hang those on prototype chain instead.

Maybe they can't be enumerable but they also couldn't clash with possible key values of the data, which leads to the toString / valueOf problem. "deserialize() and resolve('/') is indeed the same" suggests that the object returned by deserialize() is the actual data because isn't resolve() responsible for decoding the node into a proper JavaScript object?

My understanding is that serialize(node) turns node into a blob that can be turned back to node via deserialize(blob) and what node is left up to the format. Some may choose to represent that with class instances that have convenience methods and others may choose to use plain data instead.

I think you're misunderstanding resolve it's not about traversing de-serialized node, but rather about desalinization of only relevant pieces of the node.

As of comment about Object.keys(node), I think all it's trying to suggest that iterating fields of the node should only return data (and not the convenience stuff you've added - add that to the prototype).

P.S. Nave collision isn't a big deal if all the methods are in the prototype (unless user attempts to call that method), but since node interface is designed by the same author as a format likely collisions can be avoided.

I think I like the original version, or at least my reading of it, just async deserialize(), it returns some format-specific object, serialize() knows how to deal with only that type of object, resolve() lets you get at the underlying data in a predictable and uniform way across formats and can even be used to take the format-specific object and instantiate a plain old JS object from it. Async resolve() solves all of the the laziness problems raised in #50.

In line with this, the Definitions section needs to be edited or removed, it's referring to "dagNode" still while the rest has moved on to this new "IPLD Node" and IpldNode language.

This however makes me wonder if decode should just return a plain data (no fancy prototypes) as otherwise users are assumed to know what the underlying format is which seems to contradict of the IPLD vision of common interface across arbitrary formats.

@vmx
Copy link
Member Author

vmx commented Mar 25, 2019

I hope I've addressed all the review comments with the new commit.

Things I haven't addressed:

  • Is error part of API ? When implemeter is expected to error and how ? What happens when error occurs ?

I agree that this is very important and needs to be part of this spec. Though as it is not part yet, I'd like to wait until we agreed on the API (in hope to save some time).

I would assume it's about convenience methods like toJSON() or some computed fields. Personally I'm in favor of keeping data plain and exposing behavior as functions but I can see why some might choose to hang those on prototype chain instead.

This was indeed the idea. I've put it in to make it clear that you are allowed to do that. I didn't want to be too strict to begin with as it might be useful in some cases.

I think I like the original version, or at least my reading of it, just async deserialize(), it returns some format-specific object, serialize() knows how to deal with only that type of object, resolve() lets you get at the underlying data in a predictable and uniform way across formats and can even be used to take the format-specific object and instantiate a plain old JS object from it. Async resolve() solves all of the the laziness problems raised in #50.

That is 100% correct. But that's something I'd like to change in order to make IPLD Formats nicer to use.

@vmx
Copy link
Member Author

vmx commented Apr 1, 2019

@rvagg: Do your concerns still hold try. I start to implement those changes (and see how this goes).

@rvagg
Copy link
Member

rvagg commented Apr 2, 2019

@vmx I don't think I'm in a position where I can hold firm opinions about this stuff, so don't treat my comments as blockers. I'll be interested to see it in code form.

@rvagg
Copy link
Member

rvagg commented Apr 30, 2019

@mikeal and @Gozala: we had a discussion over at js-ipld-dag-pb about what's sync and what's async and in the the latest iteration of the refactoring @vmx has eradicated a the vast majority of async and faux-async operations across the current formats, particularly dag-pb which has only a single point where it does an async operations: generating a CID from an existing node. So the current iteration of this doc reflects that and is mostly-sync because the current formats are mostly-sync (and historically have been implemented faux-async with generous helpings of setImmediate()).

That's going to be a problem for anything that inserts crypto into the flow, but I think the current position is something like: there's no current implementation that inserts such needs and the performance problems of the faux-async approach has already been sufficiently highlighted so let's reserve async for only cases where it's actually necessary. Then js-ipld-stack can be the forum for getting it right with encryption as a first-class concern. That's where we either need to explore MaybePromise interfaces to have optionally-async methods and not insert a severe performance penalty where it's not necessary, or just bite the bullet and make operations async wherever they'll actually need crypto work.

@mikeal
Copy link
Contributor

mikeal commented Apr 30, 2019

Since the code is already written in js-ipld-stack, I don’t know mind having optional async in the encode/decode interface even if we aren’t using it yet. However, my take on encryption is still that it belongs in layers above the block encode/decode layer — I think it’s a better approach to build this above the data model layer so that encrypted blocks can self-describe in data you can read knowing only the codec.

If we try to do encryption at the block layer we’ll end up with a multicodec for every encryption technique and we’d need to build encrypted dag codecs in order to replicate them. None of that sounds ideal to me.

@Gozala
Copy link

Gozala commented Apr 30, 2019

we’d need to build encrypted dag codecs in order to replicate them.

I think you'll need that even if it's a separate layer because links from encrypted blocks should be hidden for entities that don't have permission (ability to decrypt that).

That is not to say it should be necessarily be done at the block layer, rather to say that if encrypted blocks were to become first class all of the replication / selection stack will need to have design constraint reflecting that sometimes Dag traversal would need to deal with decryption.

@Gozala
Copy link

Gozala commented Apr 30, 2019

@mikeal and @Gozala: we had a discussion over at js-ipld-dag-pb about what's sync and what's async and in the the latest iteration of the refactoring @vmx has eradicated a the vast majority of async and faux-async operations across the current formats, particularly dag-pb which has only a single point where it does an async operations: generating a CID from an existing node. So the current iteration of this doc reflects that and is mostly-sync because the current formats are mostly-sync (and historically have been implemented faux-async with generous helpings of setImmediate()).

Few notes:

  1. I do think removing artificial asynchronicity through setImmediate makes sense. IMO if anyone's it should be a consumer's choice and not a producers.
  2. As of making interface strictly sync I think that would be a mistake. Not because of encryption / decryption, but rather because it forces that choice onto all implementations and as I understand it primarily to remove artificial overhead. Which as I pointed it some other issue could be addressed without forcing sync choice:
     Task.spawn(function*() {
        // Task.spawn resumes generator right away if yielded value isn't a promise & awaits
        // otherwise.
        const node = yield codec.decode(bytes)
     })
    To be absolutely clear: It could be that there are other legit reasons to choose sync interface, however in this instance it just seems like swinging from one extreme to the other.

@vmx
Copy link
Member Author

vmx commented Apr 30, 2019

@Gozala We are indeed swinging into the other extreme directing with making almost everything sync. The current API doesn't work that well for numerous reasons, hence we do experiments over at https://github.com/ipld/js-ipld-stack. Util we get there I think it makes sense changing the current API to async/await and to the bare minimum what we need now, and put all other things we want/need in the future into ipld-stack.

@mikeal
Copy link
Contributor

mikeal commented May 1, 2019

As of making interface strictly sync I think that would be a mistake

Ya, you’ll notice that js-ipld-stack still allows for async and sync encoding methods. I don’t have a strong opinion about this iteration of interface-ipld-format because it matches what we’ve been able to update all the current implementations to and if we have future async needs I’m skeptical they will be implemented before we migrate to js-ipld-stack.

It is now clearer that the deserialzed data needs to have only types
specified by the IPLD Data Model.

BREAKING CHANGE: Switch from callback-based interface to a Promise-based one

The major change is switching to Promises for the function signatures.

Other changes:

 - The `multicodec` propery is now called `codec`
 - `tree()` returns an Iterable
 - All methods except for `cid()` are now synchronous
@vmx
Copy link
Member Author

vmx commented May 8, 2019

Given that @rvagg approved all the PRs of the implementations of this interface, I consider this PR approved by him. Hence merging it now. @rvagg if you don't agree or anything is missing, please open an issue.

@vmx vmx merged commit d7ae0f9 into master May 8, 2019
@vmx vmx deleted the api-review-v2 branch May 8, 2019 13:21
@ghost ghost removed the in progress label May 8, 2019
@vmx vmx mentioned this pull request May 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants