Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Making eth an application rather than a format #29

Closed
nicola opened this issue Oct 3, 2016 · 12 comments
Closed

Making eth an application rather than a format #29

nicola opened this issue Oct 3, 2016 · 12 comments
Labels
status/deferred Conscious decision to pause or backlog

Comments

@nicola
Copy link
Member

nicola commented Oct 3, 2016

This is the argument on why we should have eth-blocks not as a format and a multicodec, but a namespace and an application.

Note: Personally, I am fine either ways! I just want to make sure we make the right choice.

I tend to think that namespace work best in this particular cases, but of course, if we decide that the destiny of CID is to abstract on this, so that everything is an IPLD object, I am fine with this too (but the case for unixfs that I make below, is still valid)

This answers #27 (and https://github.com/unixfs/notes/issues/173) in a different way than the current proposal: application vs data format

Current state: Eth-block as a data format

Eth-block a data format for IPLD:

  • eth-blocks will resolve in /ipld
  • eth-blocks will need to reserve a multicodec number that will be prefixed their hash
  • eth-blocks will need ipld-parser-eth

Process to transform eth-block into IPLD:

  • read eth-block hash
  • spot multicodec
  • decode binary into IPLD

By having eth-block to be a data format, we are overspecializing a format to only work with a particular application. What if there will be 100 new cryptocurrencies? Will we create new formats?

Proposed: Eth-block as an application

Parallelism with unixfs

Let me start with this unixfs on IPLD parallelism

tl;dr:

  • unixfs takes IPLD objects and turn it into IPFS blocks (= IPLD binary)
  • ethfs takes Eth blocks (IPLD binary) and turn it into IPLD objects

unixfs as a data format

Say that we treat unixfs as a data format for IPLD, then:

  • unixfs would resolve under /ipld
  • unixfs will need to reserve a multicodec number that will be prefixed..
  • unixfs needs a ipld-parser-unixfs

Process to transform unixfs into IPLD

  • read hash, spot multicodec, transform IPLD objects into IPLD binary (= unixfs objects)

    unixfs multicodec = 0xIP
    
    // /ipld/HASH
    // say that sharding was actually done this way
    {
    shard1:
    {blocks: [{ '/': h1}, { '/': h2}, { '/': h3}]},
    shard2: ...
    }
    
    // /ipld/0xIPHASH
    Hello how is it going this is my long content...
    

unixfs as an application

Instead, for simplicity, instead of doing that, we made unixfs an application on top of IPLD, not a data format.

unixfs as an application:

  • will need a namespace /unixfs
  • /unixfs will transform IPLD objects into IPLD binary (= unixfs objects) as shown before
  • /unixfs will serve IPLD binary

Process to transform IPLD to unixfs

  • read path, spot namespace, find application mapping namespace, transform IPLD objects into IPLD binary (= unixfs objects)

Eth-block as an application

Eth-block as an application:

  • will need /eth namespace
  • /eth will transform IPLD binary (which is Eth binary block) into IPLD objects
  • /eth will serve IPLD objects (traversable & so on)

Process to transform Eth-block (= IPLD binary) into IPLD object

  • read path, spot namespace, find eth application, use eth application to transform a binary into an IPLD object

End of the story

At the end of the day, if you look at the process, it is essentially the same

Differences

  • addressing: multicodec vs namespace?
  • make binary into IPLD: parser level vs application level?

Other questions

  • Isn't Eth-block too application specific to be a data format?
  • Are we going to create data formats for every cryptocurrencies and future application specific formats?
  • CID can point to different application specific object?
  • multicodec has applications-specific format beyond data formats?

cc @diasdavid @jbenet @dignifiedquire @Stebalien

@daviddias
Copy link
Member

By having eth-block to be a data format, we are overspecializing a format to only work with a particular application. What if there will be 100 new cryptocurrencies? Will we create new formats?

This is incorrect. We don't need to make it format specific, because a valid IPLD format has to offer a way for the IPLD Resolver to resolve through that format (aka partial resolver or block scope resolver) and that is where interface-ipld-format comes in. See: ipld/js-ipld#60

Essentially, we just need a .resolve and a .tree function to make any data format work.

IPFS as a data format

Say that we treat IPFS as a data format for IPLD, then:

ipfs would resolve under /ipld
ipfs will need to reserve a multicodec number that will be prefixed..

Again, not 100% correct.

IPFS files will resolve under /ipfs - This is an application on top of IPLD, the unixfs application.

IPFS files don't need to reserve a multicodec, because multicodecs are for IPLD formats. unixfs uses a IPLD format (currently it uses the dag-pb).

ipfs needs a ipld-parser-ipfs

Yes, that is called the unixfs-engine, but it is not an IPLD Format, it is a usage of underlying IPLD formats :)

@nicola
Copy link
Member Author

nicola commented Oct 3, 2016

sorry s/IPFS/unixfs, the argument is still the same,
in the same way we build unixfs as an application and not as a format,
we maybe should build eth as an application and not as a format

Think of it as "ethfs", that instead of serving binary (IPLD binary built from IPLD objects), it server IPLD objects (from IPLD binary which is Eth binary)

Again, my argument is not really a "let's make it this way", but understand why this way is not better than the current. I fear 1000th new data formats will pop up, while they are not data formats, they are application specific formats

@daviddias
Copy link
Member

The key difference is that Ethereum already exists, it has its own Merkle Data Structure, it doesn't ride the IPLD objects like unixfs does.

@nicola
Copy link
Member Author

nicola commented Oct 3, 2016

Answer to @diasdavid:

Ethfs will just be a way to encode eth binary data into IPLD (not Ethereum).

If we have IPLD-binary, then eth objects will be IPLD by default, then you just need to pass them through /eth to make them (structured) IPLD objects.

The parallelism is more and less the following:

  • unixfs takes IPLD objects and turn it into IPFS blocks (= IPLD binary)
  • ethfs takes Eth blocks (IPLD binary) and turn it into IPLD objects

Instead of having eth-block to be a data format.

/eth/HASHofBlock is much easier and intuitive than /eth/CIDPREFIX+HashOfBlock

(Unless, what you mean by data formats, I call transformations (which has a .resolve function))

@Stebalien
Copy link
Contributor

So, I may be missing some motivation here but this is how I see it (I apologize if this is a bit disorganized or I'm missing some significant motivation).

Why CIDs

  1. Efficient storage/lookup for restricted formats.
  2. Easy format migration (IPLD+CBOR v1 -> IPLD+CBOR v2, etc.).
  3. Allows naming across existing systems like DAT (systems that support name resolution).

Format or Application Level

The key difference is that Ethereum already exists, it has its own Merkle Data Structure, it doesn't ride the IPLD objects like unixfs does.

This is really a more general problem than just eth; it also applies to applications like git, etc The trade-off is:

  1. Application Level: That is, import data from eth/git/etc into some IPLD blobstore (converting it to, e.g., IPLD+CBOR).
  2. Format Level: Have all IPLD implementations understand these formats and import them as binary blobs.

Why Application Level

  1. Fewer underlying formats. This makes it easier for implementors.
  2. Fewer ways to name the same logical object. Makes it easier to compare two objects and reason about a set of objects.
  3. Less likely to accidentally load an object stored in one format, modify it, and then save it in another format.

Why Format Level

Can name objects across datastores. That is, one can name an arbitrary Eth object in an IPLD object without importing it into an IPLD datasatore because it's already an IPLD object.

IMO, it only makes sense to do this when there is a way to resolve such names. That is, if you would have to download and import an entire dataset anyways (to hunt through it to find your data), you might as well just import it all into your blobstore and convert it to IPLD+CBOR along the way (or some other IPLD format).

For example, I think git should be implemented at the application level because there's no way to resolve a git hash into a git object (unless you have a copy of the git repo). When working with a git repo, users should just import the entire repo into IPLD.

Application Architecture Sketch

I'd like to suggest a general application architecture that works on-top of IPLD.

Import/Export

Applications built on-top of IPLD must support a way to quickly import/export objects to/from an IPLD service. This would allow one to import git/eth/etc objects into IPLD, work with them, and then export them back out (if necessary). This is basically how IPFS's tar, unixfs, etc support works today.

Naming/Indexing

Applications need to support mapping IPLD names to/from names used outside of IPLD (e.g., CID <-> git object hash). This could either be supported in the blobstore (make it a multi-key/value store that can map multiple keys (namespaced by the application) to a single value) or at the application level (store an index).

@Stebalien
Copy link
Contributor

Stebalien commented Oct 12, 2016

Another way to think about this (summary of the above): Given an IPLD object that points to a git object, what could you do with this information?

The only thing I can think of is to use some contextual information to find and download the git repository and then import it into your IPLD blob store. However, if you're going to do that, you might as well deterministically (based on the format declared in the CID) re-encode the git objects as IPLD+(CID format) objects. The only time I wouldn't do this is when there exists some way to ask some non-IPLD server for an object by-name.

@daviddias
Copy link
Member

you might as well just import it all into your blobstore and convert it to IPLD+CBOR along the way (or some other IPLD format).

Not true for most cases, double hashing and transformations have a cost, and when data has a huge churn, this cost adds up very quickly. This is way it is important to be able to reference the data in its 'native' form.

@Stebalien
Copy link
Contributor

when data has a huge churn, this cost adds up very quickly

  1. I assume you'd decode the objects and verify the hashes anyways when importing for validation.
  2. Hashing adds a constant overhead (per byte) that will be significantly less (out of date benchmarks) than the cost of downloading (unless you have a gigabit link). Encoding/decoding will add some more overhead but I doubt it will be too obscene.
  3. Downloading/re-encoding can be pipelined. Assuming that downloading is the bottleneck, you shouldn't notice a difference.

@jbenet
Copy link
Contributor

jbenet commented Oct 18, 2016

I have little time to respond in full here, sorry. the gist is this:

  • CID was invented precisely to allow us to ingest, store, address, and use
    {git, bitcoin, eth, ... } objects, in their native formats.
  • These datastructs exist already, and we will not get them to change their
    structure on our account.
  • We do not want to transform or wrap the objects in another format because
    that changes the hashes. the entire point is to be able to link using their
    exact hashes.

@nicola @diasdavid and @jbenet discussed in person recently. this should no
longer be a confusion. @Stebalien, maybe @nicola can fill you in in person?

if you need more explanation, i can try again later.

On Tue, Oct 18, 2016 at 12:27 PM, Steven Allen [email protected]
wrote:

when data has a huge churn, this cost adds up very quickly

  1. I assume you'd decode the objects and verify the hashes anyways
    when importing for validation.
  2. Hashing adds a constant overhead (per byte) that will be
    significantly less (out of date benchmarks
    https://www.cryptopp.com/benchmarks.html) than the cost of
    downloading (unless you have a gigabit link). Encoding/decoding will add
    some more overhead but I doubt it will be too obscene.
  3. Downloading/re-encoding can be pipelined. Assuming that downloading
    is the bottleneck, you shouldn't notice a difference.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#29 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAIcocBI6yTIf1-6fL0iPdqfaVimsoUQks5q1PN3gaJpZM4KMoXk
.

@nicola
Copy link
Member Author

nicola commented Jan 21, 2017

A logical follow up to this is: ipld/ipld#16

@Stebalien
Copy link
Contributor

So, when I wrote up my opinion above I claimed "IMO, it only makes sense to do this when there is a way to resolve such names." While thinking about how to use IPFS as a git caching proxy, I realized this objection was pointless: the IPFS daemon/bitswap protocol will provide this service. For example, an IPLD aware git proxy would:

  1. Fetch https://example.com/my-git-repo.git/info/refs?service=git-upload-pack' to get the hashes for the top-level git objects.
  2. Convert the hashes to CIDs.
  3. Attempt the git objects from the local IPFS daemon/resolver falling back on git when necessary.

So, while I do believe we should encourage new applications to use existing formats, making git and friends first-class formats makes sense. Sorry for the confusion.

@daviddias daviddias added the status/deferred Conscious decision to pause or backlog label Mar 19, 2018
@rvagg
Copy link
Member

rvagg commented Aug 14, 2019

Closing due to staleness as per team agreement to clean up the issue tracker a bit (ipld/team-mgmt#28). This doesn't mean this issue is off the table entirely, it's just not on the current active stack but may be revisited in the near future. If you feel there is something pertinent here, please speak up, reopen, or open a new issue. [/boilerplate]

@rvagg rvagg closed this as completed Aug 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status/deferred Conscious decision to pause or backlog
Projects
None yet
Development

No branches or pull requests

5 participants