Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2326: Label based filtering #2326

Open
wants to merge 22 commits into
base: old_master
Choose a base branch
from
Open
Changes from 5 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
187 changes: 187 additions & 0 deletions proposals/2326-label-based-filtering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Label based filtering
Copy link
Contributor

@babolivier babolivier Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bubu

This is really handy for finding stuff, especially the images and links part. As this indexing needs to be done server side and also in encrypted rooms we'd need a way to attach labels to messages, which state, this message contains an image/at least one link/an attachment/etc.

The labels for this would need to be interpreted by the server, so for encrypted rooms we'd need to use a set of well-known labels.
This will obviously leak metadata, so this would probably need to be an optional feature for encrypted rooms.

Agreed that this would be a handy feature, and that listing all labels in a room is a nice thing to have generally. I'm thinking of one possible way to do this, which would be to add an endpoint that exposes the list of labels the server knows have been used in the room (which should be fairly easy given the server will probably already store (event_id, label) tuples in its database for efficiency). For encrypted rooms, this would return a list of hashes (which is what the server considers as a list of labels for that room, since it doesn't know about the actual labels), and clients would then be able to resolve those hashes by calling /messages with a filter containing the labels to resolve, and extracting the labels from the response (which contains events that the client should be able to decrypt). This would allow such a feature to work well without having to leak more metadata.

wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!
As for metadata concerns, these are still there, as we are working on a set of ~6 fixed labels for this usecase. But this is basically already covered in the "Security Considerations" section here.

Whether or not the actually usage of these tags for images/links/etc. will become optional in E2EE chats is not part of this MSC I believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we resolve this thread?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that once I've updated the MSC to describe this solution, which I haven't got time to do yet.


## Problem

Rooms often contain overlapping conversations, which Matrix should help users
navigate.

## Context

We already have the concept of 'Replies' to define which messages are responses
to which, which [MSC1849](https://github.com/matrix-org/matrix-doc/pull/1849)
proposes extending into a generic mechanism for defining threads which could (in
future) be paginated both depth-wise and breadth-wise. Meanwhile,
[MSC1198](https://github.com/matrix-org/matrix-doc/issues/1198) is an alternate
proposal for threading, which separates conversations into high-level "swim
lanes" with a new `POST /rooms/{roomId}/thread` API.

However, fully generic threading (which could be used to implement forum or
email style semantics) runs a risk of being overly complicated to specify and
implement and could result in feature creep. This is doubly true if you try to
implement retrospective threading (e.g. to allow moderators to split off
messages into their own thread, as you might do in a forum or to help manage
conversation in a busy chatroom).

Therefore, this is a simpler proposal to allow messages in a room to be filtered
based on a given label in order to give basic one-layer-deep threading
functionality.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, the 'reply to an existing message' is an important use case.

A message cannot be replied to unless it is already labelled. Regular users cannot add labels to messages they did not author. Should a (unique) label to otherwise unlabelled messages be required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that like replying to the message based on message ID?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1849 might be better suited for this kind of thing since a client could directly request the message that is referenced... Not sure though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptman I was thinking subsequent messages could reuse the existing label. Basically avoid nested threads.

AFAICT #1849 proposes a strictly one-way relation (i.e. child points to parent). Wouldn't that make it expensive to list the thread starting from a particular message? Each subsequent message would have to be determined by reverse look-up.

Copy link
Contributor

@ptman ptman May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer hierarchical threads, but either way, regardless of the how the relationships are recorded, they can be shown flat, just like apple mail and gmail do.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I was under the impression that this proposal specifically aims for flat threads for the sake of simplicity.


## Proposal

We let users specify an optional `m.labels` field onto the events. This field
babolivier marked this conversation as resolved.
Show resolved Hide resolved
maps key strings to freeform text labels:

```json
{
// ...
"m.labels": {
"somekey": "somelabel"
}
}
```

Labels which are prefixed with # are expected to be user-visible and exposed to
the user by clients as a hashtag, letting the user filter their current room by
the various hashtags present within it. Labels which are not prefixed with # are
expected to be hidden from the user by clients (so that they can be used as
e.g. thread IDs bridged from another platform).
babolivier marked this conversation as resolved.
Show resolved Hide resolved

Clients can use these to filter the overlapping conversations in a room into
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concern overall with this proposal is that once someone uses #invites, no one can ever use that label again to start a heavily related but different thread. This is where swim lanes are nice because it means you can have overlapping subject matter without conflicting the room's perspective. Relationships (1849) also solve this problem.

different topics. The labels could also be used when bridging as a hashtag to
help manage the disconnect which can happen when bridging a threaded room to an
unthreaded one.

Clients are expected to explicitly set the label on a message if the user's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the" implies there's only one you have to worry about - shouldn't you be copying the whole set?

intention is to respond as part of a given labelled topic. For instance, if the
user is currently filtered to only view messages with a given label, then new
messages sent should use the same label. Similarly if the user sends a reply to
a given message, that reply should typically use the same labels as the message
being replied to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My remaining doubt is how we're expecting clients to expose adding a label initially, eg. if there would be another button in the composer or similar to add a label to a new message or whether you'd just let them be added retrospectively. Likewise, would we expect to show the labels on each message / show on hover etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some clarifications on that point in 4b7ca52, but designing UI isn't my area of expertise so I'm happy to discuss it.


When a user wants to filter a room to given label(s), it defines a filter for
use with /sync or /messages to limit appropriately. This is done by new `labels`
and `not_labels` fields to the `EventFilter` object, which specifies a list of
labels to include or exclude in the given filter.

### Encrypted rooms

In encrypted events, the string used as the key in the map is a SHA256 hash of a
contatenation of the text label and the ID of the room the event is being sent
babolivier marked this conversation as resolved.
Show resolved Hide resolved
to. Once encrypted by the client, the resulting `m.room.encrypted` event's
content contains a `m.labels_hashes` property which is an array of these hashes.
babolivier marked this conversation as resolved.
Show resolved Hide resolved

babolivier marked this conversation as resolved.
Show resolved Hide resolved
When filtering events based on their label(s), clients are expected to use the
hash of the label(s) to filter in or out instead of the actual label text.

#### Example

Consider a label `#fun` on a message sent to a room which ID is
`!someroom:example.com`. Before encryption, the message would be:

```json
{
"type": "m.room.message",
"content": {
"body": "who wants to go down the pub?",
"msgtype": "m.text",
"m.labels": {
"3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a": "#fun"
}
}
}
```

`3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a` is the SHA256
hash of the string `#fun!someroom:example.com`.

Once encrypted, the event would become:

```json
{
"type": "m.room.encrypted",
"content": {
"algorithm": "m.megolm.v1.aes-sha2",
"ciphertext": "AwgAEpABm6.......",
"device_id": "SOLZHNGTZT",
"sender_key": "FRlkQA1enABuOH4xipzJJ/oD8fxiQHj6jrAyyrvzSTY",
"session_id": "JPWczbhnAivenK3qRwqLLBQu4W13fz1lqQpXDlpZzCg",
"m.labels_hashes": [
"3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a"
]
}
}
```

### Unencrypted rooms

In unencrypted rooms, the string to use as a key does not matter (as this format
is only kept for consistency with events sent in encrypted rooms) and clients
are free to use any non-empty string they wish (as long as it's unique per label
in the event).

When filtering events based on their label(s), clients are expected to use the
actual label text instead of the string key.

#### Example

```json
{
"type": "m.room.message",
"content": {
"body": "who wants to go down the pub?",
"msgtype": "m.text",
"m.labels": {
"somekey": "#fun"
}
}
}
```

## Problems

Do we care about internationalising hashtags?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the room will be using a specific language, so probably not.


Too many threading APIs?

Using hashes means that servers could be inclined to compute rainbow tables to
read labels on encrypted messages. However, since we're using the room ID as
some kind of hash, it makes it much more expensive to do because it would mean
maintaining one rainbow table for each encrypted room it's in, which would
probably make it not worth the trouble.

## Alternative solutions
Copy link
Member Author

@ara4n ara4n Sep 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just rereading this proposal, I think @dkasak's points on the main thread are legitimate: that hashing the labels give a very false sense of security here. Given how strong our e2ee is, folks will assume opaque labels are actually encrypted, rather than just obfuscated by a hash which can be easily rainbow-tabled.

Personally, I think it'd be fine to add a pepper to the hashed events, and require at the application level that for labels to work in encrypted rooms, the new user must be brought up to speed on the pepper (e.g. by the inviter sharing the pepper in an encrypted message, possibly to-device, after having invited them).

This is simpler than using opaque IDs for the unencrypted event headers, as there's only one pepper that needs to be shared to new users, rather than the whole set of opaque->real label mappings.


Instead of using hashes to identify labels in encrypted messages, using random
opaque strings was also considered. Bearing in mind that we need to be able to
use the label identifiers to filter the history of the room server-side (because
we're not expecting clients to know about the whole history of the room, see my
first point above), this solution had the following downsides, all originating
babolivier marked this conversation as resolved.
Show resolved Hide resolved
from the fact that nothing would prevent 1000 clients from using each a
different identifier:

* filtering would have serious performances issues in E2EE rooms, as the server
would need to return all events it knows about which label identifier is any
of the 1000 identifiers provided by the client, which is quite expensive to
do.

* it would be impossible for a filtered `/message` (or `/sync`) request to
include every event matching the desired label because we can't expect a
client to know about every identifier that has been used in the whole history
of the room, or about the fact that another client might suddenly decide to
use another identifier for the same label text, and include those identifiers
in its filtered request.

Another proposed solution would be to use peppered hashes, and to store the
pepper in the encrypted event. However, this solution would have the same
downsides as described above.

## Unstable prefix

Unstable implementations should hook up `org.matrix.labels` rather than
`m.labels`, and `org.matrix.labels_hashes` rather than `m.labels_hashes`. When
defining filters, they should also use `org.matrix.labels` and
`org.matrix.not_labels` in the `EventFilter` object.

Additionally, servers implementing this feature should advertise that they do so
by exposing a `label_based_filtering` flag in the `unstable_features` part of
babolivier marked this conversation as resolved.
Show resolved Hide resolved
the `/versions` response.