Agentic Federation on Iroh: Graph Data & Linked Documents Layers #32

zicklag · 2024-06-01T20:52:10Z

zicklag
Jun 1, 2024
Maintainer

This was some quick thoughts posted the Commune Discord. I'm cross-posting here for link/search/persist-ability.

So, along the lines of a next-gen federation protocol on top of Iroh, after comparing Weird's needs, my own thoughts for a note-taking/garden/zettelkasten tool, some ideas I was talking to @mp about, and everything that we've just thrown around here, I think I'm coming to the idea of two layers on top of Iroh that we want, that could be shared between all these ideas.

These layers could be made into a libraries/specifications that I think would be good building blocks for other projects.

This is all super rough draft idea, so it needs more thought, but I wanted to share.

The first layer is a graph layer.
The second layer, on top of the first, is a linked document layer.

Graph Layer

The graph layer would be a very simple standard for storing graph data on top of the existing key-value store in Iroh.

Something roughly like:

Each value in the key-value store must be in the graph format
Each value is one of the following: string, blob, list, object, bool, null, or link.
Each kind of value is identified with a byte prefix
Strings, blobs, bools, and nulls are stored inline in the value
Links have reference another key, to link to another value
Lists have a reference to a key prefix, and every key that starts with that prefix is considered in the list, and the order is based on the byte-sorted keys.
Objects have a reference to a key prefix, and every key that starts with that prefix is considered in the object, the portion of the key after the prefix represents the key in the object.

Linked Documents

The linked document layer would be a standard built on top of the graph store.

It's built around storing documents.

Each document would have:

Metadata: metadata would be an object stored in the graph format.
Links: links are references to other documents.
Body: the raw body of the document. This is just a bunch of bytes. The exact format of the body should be described in the metadata. So it could be a chat, a file, a note, a blog post, markdown, HTML or whatever.

The idea is that the linked document layer can be used to build a kind of "internet".

But it's built not on HTML, but on all of these different kinds of documents we create, whether they be chats or forum topics or whatever.

They should all be able to be linked to each-other, having different kinds of relationships with each-other.

zicklag
Jun 2, 2024
Maintainer Author

Had another thought: I think document bodies with rich text should use "rich text facets" like Bluesky does, by default, though obviously you can totally make other formats, and just specify different metadata so the client knows what to do with it.

https://www.pfrazee.com/blog/why-facets

I think the docs mentioned that facets couldn't overlap each-other, which seems wrong to me, though. Like what if you want bold italic bold? I haven't looked deep into it, but the important part is being able to specify the formatting/annotations etc, separate from the content, and allow clients to independently make decisions on markup, WYSIWYG, etc., and even make their own custom annotations.

The other reason I think facets is a good idea is because it lets us easily extract all of the links to other documents that are in the body of the document.

For example, if we have an embed facet, then you can instantly parse out all of the embeds in a post, without having to parse out markdown or something. The data is just there.

0 replies

zicklag · 2024-06-02T13:56:59Z

zicklag
Jun 2, 2024
Maintainer Author

Regarding the document format, it might be simplest to just represent each document as an Object in the graph, that has a body.

I think we might not need the document's links as a separate concept, links, by being built on the graph protocol, are just automatically supported in Objects.

So for instance, a could just be:

{
    "comment_on": ["namespaceid", "pathtopost"], // this is a link
    "body": "Cool stuff!",
}

I really like how simple this is. We should probably standardize on some other common metadata, though, like headers in HTTP. We'll at the very least want a content_type, so that you know whether the doc is JSON/HTML/Markdown/FacetAnnotatedDoc or something like that.

0 replies

zicklag · 2024-06-05T02:40:37Z

zicklag
Jun 5, 2024
Maintainer Author

Just got the first working example of an Iroh Graph API. In the example below the state.graph is a IrohGStore.

https://github.com/commune-os/weird/blob/562a452b484fe307e4ce3cd0aa07a690e2885de2/backend/src/routes/profile.rs#L17-L82

I've iterated on the idea a little bit and removed the concept of lists. Being built on a key-value store, lists are not worth implementing properly, and you might as well just use a map, and iterate over the keys, and use ULID/UUID for the keys if you don't "care" about a key being semantically meaningful.

1 reply

zicklag Jun 7, 2024
Maintainer Author

Another point about storing lists as maps in this model:

Sets could be represented by putting the value in the key. For example a unique list of tags could be represented by making a map where each value in the map is null, and each key is the tag name.

zicklag · 2024-06-29T21:29:00Z

zicklag
Jun 29, 2024
Maintainer Author

Right now the one thing about our data model that is bugging me slightly is the Map type. There's nothing in the underlying store that prevents a String value from also having other values in it like a map, that can be retrieved with value.get_key("somekey"). Also, the Map type is really just a marker that says, I expect to have sub-keys, but it doesn't mean anything else, so it seems semi-useless. It could just as well not be there, or be Value::Null. It's more for documentation than anything else.

This leads me to an idea to replace the Map type with a Schema type. This would contain metadata describing the expected types of data that is stored under that prefix.

This could allow you to automatically validate and load into Rust structs that could do type-checked FromValue implementations of some sort. I think this could improve the lack of static typing.

Needs a little more thought, but think it makes sense.

4 replies

zicklag Jul 11, 2024
Maintainer Author

As an iteration of this concept, I realized that for each "object" in the store, it's effective schema should be a list of "interfaces" that it implements, similar to "components" in an ECS, or "traits" in Rust.

I recently ran into the fact that we have a Profile object in Weird, that stores all the user profile info, but the Weird web instance wants to additionally add a domain field to the Profile, which was not originally meant to be a part of the Profile object. In this scenario we essentially have a 3rd party wanting to "add" or "implement" an extra interface for an object.

This challenges the concept of having a static Schema on the object, and storing the schema in the top level of the object.

This makes it seem that we have two kinds of "objects": "Entities", and "Components".

An Entity is like what we currently define as a Value::Map, where each field in the map is expected to be a Component.

A Component is like what we currently define as a Value::Map, where every field is expected to conform to a specific schema, and is not intended to be extended by apps in a way that violates it's schema.

In this way, entities are the foundation any kind of "object" we might think of from a user perspective. For example, chat messages, blog posts, comments, tweets, pages, projects, teams, profiles, users, etc. would all be entities. Because they are entities, they are extensible.

Extensible entities allow different clients that understand different components, to render or explore the entity through whichever components it understands. For example, you may have a Name component that provides a display name for the entity, and you may have a Description component that is meant to contain a short description of an entity. You might also have an Image component that contains something like an avatar or a feature image for the entity. These components are applicable for a wide variety of objects, from blog posts, to user profiles.

By breaking objects down into standard components, a CLI interface that only knows how to deal with the Name and Description can read those components without worrying about the Image component, and a link preview on a microblogging platform can display the Name, Description, and Image of an object, without knowing or caring whether it is a blog post, or a user profile page, or something else entirely.

Different applications may create their own components to represent extra data that may be tied to entities, without needing other applications to understand the component.

Components should hypothetically be small and focused most of the time, so that entities can be made out of a combination of meaningful components, that each app can support incrementally.

@erlend-sh had also brought up the Block Protocol, which is a kind of similar component ideology from frontend UI components. It would be beneficial to be able to integrate with the Block protocol for rendering some standard components, so that web frontends can optionally re-use actually web components associated to different data components.

From a data perspective then, we replace Value::Map with two new variants, Value::Entity and Value::Component(Schema).

aschrijver Jul 12, 2024

I have Block Protocol on a list of techz that I track and find very interesting. But there are things to be wary of. Previously Github explored adoption and chose not to adopt. And HASH recently chose to no longer pursue planned environments Github and Figma. The old site still had:

And HASH used to be a seemingly much more interesting company before they pivoted to their current bland Yet-Another-AI proposition.

zicklag Jul 12, 2024
Maintainer Author

But there are things to be wary of.

Yeah, I think it's something we can harmlessly and optionally experiment with, if it doesn't like suck up a bunch of our time.

It'd be nice if it worked, but people, ourselves included can always just choose to make custom web components for different data components.

erlend-sh Jul 12, 2024
Maintainer

Yeah, we will only use Block insofar as it can give us certain desirables for free.

The protocol itself hasn’t caught on, so the interop prospects alone are not valuable to us.

aschrijver · 2024-07-12T04:20:36Z

aschrijver
Jul 12, 2024

Also I'd like to get name ideas for these layers.

I wonder if its better to refer to Resources instead of Documents in a way that is similar to how we refer to Web resources. Then a Resource has Metadata and Content, which is defined by its Content Type. And it can be represented in a Document. I feel Resource fits the abstraction better and aligns well to existing uses of this terminology.

4 replies

aschrijver Jul 12, 2024

PS. Reading in Iroh docs now the terminology of "documents" may refer to Iroh Documents? Above I did not consider them such, but as extra layers on top of Iroh related to 'agentic federation' / weird. I.e. new stuff to be modeled and named :)

zicklag Jul 12, 2024
Maintainer Author

Yeah, I had some trouble even writing out the overview above without stepping on terms. :)

Right now, key-value stores in Iroh are called "Documents", but once the next iteration is complete they will be more appropriately called "Namespaces".

As far as the concept of Resources with Metadata and Content, I was initially thinking along these lines, but now I think I'm liking the Entity and Component model more, since it means that an Object/Entity/Resource/Document/whatever-we-call-the-core-"thing" is not restricted to having only one content type.

I like that each Entity is essentially the sum of it's multiple "content-types" ( components ).

erlend-sh Jul 12, 2024
Maintainer

Sound good. Might be useful to have some surface-level familiarity with WordPress’ ‘post types’ as well:

https://wordpress.org/documentation/article/what-is-post-type/
https://developer.wordpress.org/themes/basics/post-types/

zicklag Jul 12, 2024
Maintainer Author

Yeah, that's good stuff to look at for sure. We're definitely going to want to grow an ecosystem of standard components and browsing things like Wordpress page types, ActivityStreams and other similar apps and standards will be a good exercise.

The goal will be to break all of these different kinds of content into little chunks that might be shared between them. We also should think about how we might represent things like edits and version histories.

zicklag · 2024-07-12T17:14:11Z

zicklag
Jul 12, 2024
Maintainer Author

Extending on #32 (reply in thread), I had some important thoughts on how to represent Components. Right now we have a separate key-value entry for each primitive value in the namespace.

For example, if we have some data that has a name and a description, then each of those would be in their own keys. This breaks the data in to really small pieces, which is possibly non-ideal. If it's something that would always be read/written together at the same time, then it'd be good to store them under one key.

Another consideration, good or bad, is that everything under a single key has to be updated at the same time. For example, if you have Name and Description under the same key, then if you update the name on one computer, and the description on another computer, they won't be merged natively. Whichever change was made most recently will apply, taking both the name and the description together.

This actually might be a good thing in many cases, because otherwise it's impossible to make sure two pieces of data are definitely updated at the same time.

Anyway, initially I kept all pieces of data separate because it was the simplest way to start out, and I wanted to allow multiple services to be able to add their own data to objects.

With the Entity Component model, though, I think components give us a good way to break things up into appropriately sized pieces that will be updated together.

So, instead I think we should store each component's data serialized together under the same key. That means the structure of an entity in the key-value store will look like:

Key:                                                  Value:
["path", "to", "entity"]                              Entity
["path", "to", "entity", "HashOfComponentSchemaXXX"]  Component(`binary serialized component data`)
["path", "to", "entity", "OtherComponentSchemaHash"]  Component(`binary serialized component data`)

Notice that the key to the component, under the entity, is actually the hash of it's schema. So by reading the key of the component, we can retrieve the schema, and then deserialize the binary component data according to the schema.

I think this is quite simple and elegant.

If we are serializing multiple values into the component data now, we have to think about the serialization format.

Because the value of components that are stored in Iroh are actually hashed and stored in the blob store, it would be great to take advantage of automatic de-duplication for components that have the same value.

To get the best de-duplication it's best to have the serializer always produce a canonical set of bytes for any given object. That means that if an object has the same data, it always has the same bytes.

That made me think about Borsh for serializing and deserializing. It's very fast, simple, and designed to always use a canonical byte representation. It also already has an ( unstable but working ) schema system in the Rust implementation: borsh::schema::Definition.

Borsh also has libraries for many languages already, which is good: Rust, TypeScript, JavaScript, Java, Kotlin, Scala, Clojure, Go, Python, AssemblyScript, C#, C++, C++20, and Elixr.

The Borsh schema can also be serialized using Borsh, which is perfect for our use-case of storing the schema itself in the Iroh store.

Finally, it's so simple that if we needed to make our own implementation or fork the specification for whatever reason, then we could do so.

I'm really liking the way this feels. I want to give it some more time to settle, but I think it accomplishes our goals very well.

5 replies

zicklag Jul 12, 2024
Maintainer Author

It's worth noting that Borsh doesn't have a native concept of links like we do with our current Value type. I think this might be fine.

For example, Borsh still lets us store all the link data we need to just fine, it's just not inherently and obviously a "link".

Consider, though, how this is not much different than the ambiguity that you will run into with many other kinds of data.

For example, a schema may have a body field that is a String type. That's fine, but that doesn't tell us whether it's HTML, Markdown, BBCode or anything else. We must still have external context to understand that this String is actually some sort of a markup format. We also face the same problem with pretty much any kind of binary data. Maybe there is a data field, and maybe their is a content-type field next to it, by which we can determine the type of bytes in data, but we still depend on external documentation and understanding of the schema to properly comprehend the data.

The question is whether or not the concept of a "link" to a another entry in the graph store is so fundamental that it should be represented by it's own data type, so that, for example, generic data explorers, can follow those links around to other entities, even if the explorer doesn't understand the rest of the component data itself.

I'm currently on the fence. It seems like, in order to call it a "graph store", we very much need a native concept of "links" with which we can build the graph.

My hesitation is that we must fork Borsh at that point.

Ah wait! I just had another idea, though. The only real distinction between some bytes and a "link" is actually in the schema, not in the borsh data format itself!

What this means is that we don't have to fork borsh, we just have to make our own standard for the currently unstable and unstandardized Borsh schema. This is perfect for our use-case, especially because it's not something that has already been stabilized anyway.

I think that's the answer I was looking for.

mikebryant Jul 14, 2024

So, instead I think we should store each component's data serialized together under the same key. That means the structure of an entity in the key-value store will look like:
Key:                                                  Value:
["path", "to", "entity"]                              Entity
["path", "to", "entity", "HashOfComponentSchemaXXX"]  Component(`binary serialized component data`)
["path", "to", "entity", "OtherComponentSchemaHash"]  Component(`binary serialized component data`)

What happens when someone's reading an entity and it's updated mid-way through? Or your sync is interrupted mid-way through a session and you only have partial data for an entity?

Is there any way of making an entity update atomic?

zicklag Jul 14, 2024
Maintainer Author

I think in Iroh the only atomic updates are at the individual key level. So each component would be updated atomically, and you would never get a half-updated data for a single component, but you couldn't atomically update multiple components at the same time.

I'm not sure if that's something that could come in the future, or not.

I think that might actually be fine, because we don't want to force you to update an entire entity and all it's components at the same time all the time, but we also probably don't want to break the entity up completely so that we can't atomically update any fields.

mikebryant Jul 14, 2024

🤔 Could the Entity itself have the references to the components - so you update/upload components first, then update the Entity with all the references to the current state?

I'm wondering what the user experience will be like if you end up with Components like a facet for embeds or highlighting, that ends up out of sync with the payload component. I think people can expect eventual consistency on the level of "I can't see all of the posts of this user yet", but they're going to be much more confused by "The post is all garbled and weird because I only got half the updates for it"

In particular, is it a problem if say, a malicious relay node or peer only takes part of your updates to a single entity.

zicklag Jul 14, 2024
Maintainer Author

That's a good point. Maybe we could have a concept of a component aggregate. Basically, a way to pub multiple components into a single key, to allow you to intentionally force an atomic update of multiple components, without changing the way that applications read/understand the data in the components.

That would actually be really easy to add. For example, because the hashes of each component are a known length, we could structure an atomic update of two components like this:

Key:                                                  Value:
["path", "to", "entity"]                              Entity
["path", "to", "entity", "HashOfComponentSchemaXXXOtherComponentSchemaHash"]  Component(`component1binarydata``component2binarydata`)

That keeps the attribute of being able to read the key and know schema immediately. You just split the key into chunks the length of the hash, and each individual hash would be an individual schema, and you can read the components out one by one.

That does reduce your de-duplication by merging the components into one value, though. 🤔

So, it wouldn't be efficient to do that with an Image component and a Body component. But you might want to to make sure that the image of your tweet always matches the content.

Thinking about this again, maybe we could use Iroh collections for components.

A collection is an immutable list of blobs.
The value of each entity is set to a CollectionId.
Each blob in the entity's collection would store a component.
Each component is prefixed with the hash of it's schema, followed by the component's binary data.

That would make all updates to entities completely atomic.

The only disadvantage is that you have to manually implement any kind of merging of updates to the same entity, but I feel like you'll have to do that most of the time anyway, using something like Automerge or other CRDTs, so I don't think it's something to really consider a disadvantage.

By-and-large, you will want all updates to entities to be completely atomic.

I think this is actually a really good iteration on the idea. Thanks for forcing me to think harder about it @mikebryant!

zicklag · 2024-07-14T18:09:15Z

zicklag
Jul 14, 2024
Maintainer Author

Starting a thread on Encryption:

Continuing from this idea: #32 (reply in thread), copied from what I posted in chat.

I think there's a reasonable way to handle encryption suitable for most use-cases without having to encrypt keys.

I think usually you want to use key paths for grouping and discovering entities, not for describing the entities, which should be done with components. So, for instnace, the filename wouldn't be in the key, it would be in a component, and the key should be somehting like a ULID, or if you want to avoid any semantic relationship to time, a UUID. So then keys are always used as a way to group things under prefixes for the sake of public visibility or permissions like in combination with meadowcap.

Having taken care of that, all we have to worry about is encrypting the components. @mikebryant I like your idea of supporting different encryptions, so I'm thinking that we have each component prefixed with an encryption ID. For unencrypted content, that could literally be prefixing the component with a null byte, if it's prefixed with 0x01 instead we could read like a ULID that represents the ID of some encryption standard, kind of similar in concept to the "multihash" in IPFS or even DIDs except it'd be a "multiencrypt", and we could use ULIDs to avoid having to register encryption ids with a centralized authority.

Anyway, that allows you to optionally encrypt each component individually. And since the schema hashes are enceded with the component data, they are encrypted to, so you avoid exposing any metadata except for the key of the entity.

Another cool thing about this is that it allows you to put encrypted components on an entity and unencrypted components, on a component-by-component basis. This is great for storing things like private contact information on Weird profiles.

We can start by just prefixing all our components with a null byte to indicate that it's unencrypted, and then flesh out the rest of our encryption standard later, without having to do anything about it now.

0 replies

Agentic Federation on Iroh: Graph Data & Linked Documents Layers #32

zicklag Jun 1, 2024 Maintainer

Graph Layer

Linked Documents

Replies: 7 comments · 14 replies

zicklag Jun 2, 2024 Maintainer Author

zicklag Jun 2, 2024 Maintainer Author

zicklag Jun 5, 2024 Maintainer Author

zicklag Jun 7, 2024 Maintainer Author

zicklag Jun 29, 2024 Maintainer Author

zicklag Jul 11, 2024 Maintainer Author

aschrijver Jul 12, 2024

zicklag Jul 12, 2024 Maintainer Author

erlend-sh Jul 12, 2024 Maintainer

aschrijver Jul 12, 2024

aschrijver Jul 12, 2024

zicklag Jul 12, 2024 Maintainer Author

erlend-sh Jul 12, 2024 Maintainer

zicklag Jul 12, 2024 Maintainer Author

zicklag Jul 12, 2024 Maintainer Author

zicklag Jul 12, 2024 Maintainer Author

mikebryant Jul 14, 2024

zicklag Jul 14, 2024 Maintainer Author

mikebryant Jul 14, 2024

zicklag Jul 14, 2024 Maintainer Author

zicklag Jul 14, 2024 Maintainer Author

zicklag
Jun 1, 2024
Maintainer

Replies: 7 comments 14 replies

zicklag
Jun 2, 2024
Maintainer Author

zicklag
Jun 2, 2024
Maintainer Author

zicklag
Jun 5, 2024
Maintainer Author

zicklag Jun 7, 2024
Maintainer Author

zicklag
Jun 29, 2024
Maintainer Author

zicklag Jul 11, 2024
Maintainer Author

zicklag Jul 12, 2024
Maintainer Author

erlend-sh Jul 12, 2024
Maintainer

aschrijver
Jul 12, 2024

zicklag Jul 12, 2024
Maintainer Author

erlend-sh Jul 12, 2024
Maintainer

zicklag Jul 12, 2024
Maintainer Author

zicklag
Jul 12, 2024
Maintainer Author

zicklag Jul 12, 2024
Maintainer Author

zicklag Jul 14, 2024
Maintainer Author

zicklag Jul 14, 2024
Maintainer Author

zicklag
Jul 14, 2024
Maintainer Author