Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2326: Label based filtering #2326

Open
wants to merge 22 commits into
base: old_master
Choose a base branch
from
Open
Changes from 19 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
210 changes: 210 additions & 0 deletions proposals/2326-label-based-filtering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
# Label based filtering
Copy link
Contributor

@babolivier babolivier Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bubu

This is really handy for finding stuff, especially the images and links part. As this indexing needs to be done server side and also in encrypted rooms we'd need a way to attach labels to messages, which state, this message contains an image/at least one link/an attachment/etc.

The labels for this would need to be interpreted by the server, so for encrypted rooms we'd need to use a set of well-known labels.
This will obviously leak metadata, so this would probably need to be an optional feature for encrypted rooms.

Agreed that this would be a handy feature, and that listing all labels in a room is a nice thing to have generally. I'm thinking of one possible way to do this, which would be to add an endpoint that exposes the list of labels the server knows have been used in the room (which should be fairly easy given the server will probably already store (event_id, label) tuples in its database for efficiency). For encrypted rooms, this would return a list of hashes (which is what the server considers as a list of labels for that room, since it doesn't know about the actual labels), and clients would then be able to resolve those hashes by calling /messages with a filter containing the labels to resolve, and extracting the labels from the response (which contains events that the client should be able to decrypt). This would allow such a feature to work well without having to leak more metadata.

wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!
As for metadata concerns, these are still there, as we are working on a set of ~6 fixed labels for this usecase. But this is basically already covered in the "Security Considerations" section here.

Whether or not the actually usage of these tags for images/links/etc. will become optional in E2EE chats is not part of this MSC I believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we resolve this thread?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that once I've updated the MSC to describe this solution, which I haven't got time to do yet.


## Problem

Rooms often contain overlapping conversations, which Matrix should help users
navigate.

## Context

We already have the concept of 'Replies' to define which messages are responses
to which, which [MSC1849](https://github.com/matrix-org/matrix-doc/pull/1849)
proposes extending into a generic mechanism for defining threads which could (in
future) be paginated both depth-wise and breadth-wise. Meanwhile,
[MSC1198](https://github.com/matrix-org/matrix-doc/issues/1198) is an alternate
proposal for threading, which separates conversations into high-level "swim
lanes" with a new `POST /rooms/{roomId}/thread` API.

However, fully generic threading (which could be used to implement forum or
email style semantics) runs a risk of being overly complicated to specify and
implement and could result in feature creep. This is doubly true if you try to
implement retrospective threading (e.g. to allow moderators to split off
messages into their own thread, as you might do in a forum or to help manage
conversation in a busy chatroom).

Therefore, this is a simpler proposal to allow messages in a room to be filtered
based on a given label in order to give basic one-layer-deep threading
functionality.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, the 'reply to an existing message' is an important use case.

A message cannot be replied to unless it is already labelled. Regular users cannot add labels to messages they did not author. Should a (unique) label to otherwise unlabelled messages be required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that like replying to the message based on message ID?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1849 might be better suited for this kind of thing since a client could directly request the message that is referenced... Not sure though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptman I was thinking subsequent messages could reuse the existing label. Basically avoid nested threads.

AFAICT #1849 proposes a strictly one-way relation (i.e. child points to parent). Wouldn't that make it expensive to list the thread starting from a particular message? Each subsequent message would have to be determined by reverse look-up.

Copy link
Contributor

@ptman ptman May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer hierarchical threads, but either way, regardless of the how the relationships are recorded, they can be shown flat, just like apple mail and gmail do.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I was under the impression that this proposal specifically aims for flat threads for the sake of simplicity.


## Proposal

We let users specify an optional `m.labels` field onto the events. This field
babolivier marked this conversation as resolved.
Show resolved Hide resolved
lists freeform text labels:

```json
{
// ...
"m.labels": [ "somelabel" ]
}
```

The labels are expected to be insensitive to case, therefore clients are
expected to lowercase them before sending them to servers. A label's length is
limited to a maximum of 100 characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enforced how? What error code do I get for sending such an event? what if it's received over federation with >100 characters?


Labels which are prefixed with # are expected to be user-visible and exposed to
the user by clients as a hashtag, letting the user filter their current room by
the various hashtags present within it. Labels which are not prefixed with # are
expected to be hidden from the user by clients (so that they can be used as
e.g. thread IDs bridged from another platform).
babolivier marked this conversation as resolved.
Show resolved Hide resolved

Clients can use these to filter the overlapping conversations in a room into
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concern overall with this proposal is that once someone uses #invites, no one can ever use that label again to start a heavily related but different thread. This is where swim lanes are nice because it means you can have overlapping subject matter without conflicting the room's perspective. Relationships (1849) also solve this problem.

different topics. The labels could also be used when bridging as a hashtag to
help manage the disconnect which can happen when bridging a threaded room to an
unthreaded one.

Clients are expected to explicitly set the label on a message if the user's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the" implies there's only one you have to worry about - shouldn't you be copying the whole set?

intention is to respond as part of a given labelled topic. For instance, if the
user is currently filtered to only view messages with a given label, then new
messages sent should use the same label. Similarly if the user sends a reply to
a given message, that reply should typically use the same labels as the message
being replied to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My remaining doubt is how we're expecting clients to expose adding a label initially, eg. if there would be another button in the composer or similar to add a label to a new message or whether you'd just let them be added retrospectively. Likewise, would we expect to show the labels on each message / show on hover etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some clarifications on that point in 4b7ca52, but designing UI isn't my area of expertise so I'm happy to discuss it.


When a user wants to filter a room to given label(s), it defines a filter for
use with `/sync`, `/context`, `/search` or `/messages` to limit appropriately.
This is done by new `labels` and `not_labels` fields to the `EventFilter`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this doesn't seem to make it possible to expose to the user which labels are present in a room (since the client would have to scan all of the events in a room for labels). IMO, it would be poor UX to not have available labels visible. I suppose there could be two ways of doing this:

  1. Have the server scan each event for labels and keep track of them
  2. Add a state event (ex, m.labels) where the state key is set to a label and the content could contain attributes like description or label color (if desired for UX). IMO this would be the cleanest option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a proposal about an endpoint to get the list of labels available in a given room here: #2326 (comment), but haven't gotten around to add it to the proposal yet.

object, which specifies a list of labels to include or exclude in the given
filter.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be something for another MSC down the road, but would it make sense to have power level restrictions for labels? For example, I may want to have an announcements label in my room that only bots and I can post in. For example, a GitHub status label could be useful to filter out announcement spam from the main room conversation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting thought, but is probably out of the scope of this MSC imho

Senders may edit the `m.label` fields in order to change the labels associated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m.labels is supposed to be plural though? This proposal seems confused as to what it wants to offer: a single label or many. What are the benefits of many labels anyways?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be premature to discuss editing events before MSC1849 is ready; but let's at least link to it to clarify what we mean by "edit"

with an event. If an edit removes a label that was previously associated with
the original event or a past edit of it, neither the original event nor an edit
of it should be returned by the server when filtering for events with that
label.

When editing the list of labels associated with an encrypted event, clients must
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph should be under "Encrypted rooms"

set the updated list of labels in the `content` field of the encrypted event in
addition with the `m.new_content` field of the decrypted event's `content`
field, so that servers can update the list of labels associated with the
original event accordingly.

When sending an edit of an event that has labels attached to it, clients are
expected to provide a list of labels, even if the edit doesn't add or remove any
label from the list provided in the original event or its latest edit (in this
case, the list is the same as the one provided in the original event or its
latest edit).

### Encrypted rooms

In encrypted rooms, the `m.label` field of `m.room.encrypted` events contains,
for each label of the event that's being encrypted, a SHA256 hash of the
concatenation of the text label and the ID of the room the event is being sent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the client isn't required to lowercase the label before sending, it's just expected that they do. To make this a bit safer to use it would be best if the client received an error for not lowercasing their labels.

to, i.e. `hash = SHA256(label_text + room_id)`.
babolivier marked this conversation as resolved.
Show resolved Hide resolved

The reason behind using a hash built from the text label and the ID of the room
here instead of e.g. a random opaque string or a peppered hash is to maintain
consistency of the key without having access to the entire history of the room
or exposing the actual text of the label to the server, so that e.g. a new
client joining the room would be able to use the same key for the same label as
any other client. See the ["Alternative solutions"](#alternative-solutions) for
more information on this point.

babolivier marked this conversation as resolved.
Show resolved Hide resolved
When filtering events based on their label(s), clients are expected to use the
hash of the label(s) to filter in or out instead of the actual label text.

#### Example

Consider a label `#fun` on a message sent to a room which ID is
`!someroom:example.com`. Before encryption, the message would be:

```json
{
"type": "m.room.message",
"content": {
"body": "who wants to go down the pub?",
"msgtype": "m.text",
"m.labels": [ "#fun" ]
}
}
```

`3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a` is the SHA256
hash of the string `#fun!someroom:example.com`. Here's an example code
(JavaScript) to compute it:

```javascript
label_key_unhashed = "#fun" + "!someroom:example.com"
hash = crypto.createHash('sha256');
hash.write(label_key_unhashed);
label_key = hash.digest("hex"); // 3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a
```

Once encrypted, the event would become:

```json
{
"type": "m.room.encrypted",
"content": {
"algorithm": "m.megolm.v1.aes-sha2",
"ciphertext": "AwgAEpABm6.......",
"device_id": "SOLZHNGTZT",
"sender_key": "FRlkQA1enABuOH4xipzJJ/oD8fxiQHj6jrAyyrvzSTY",
"session_id": "JPWczbhnAivenK3qRwqLLBQu4W13fz1lqQpXDlpZzCg",
"m.labels": [
"3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a"
]
}
}
```

## Problems

Do we care about internationalising hashtags?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the room will be using a specific language, so probably not.


Too many threading APIs?

Using hashes means that servers could be inclined to compute rainbow tables to
read labels on encrypted messages. However, since we're using the room ID as
some kind of hash, it makes it much more expensive to do because it would mean
maintaining one rainbow table for each encrypted room it's in, which would
probably make it not worth the trouble.

## Alternative solutions
Copy link
Member Author

@ara4n ara4n Sep 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just rereading this proposal, I think @dkasak's points on the main thread are legitimate: that hashing the labels give a very false sense of security here. Given how strong our e2ee is, folks will assume opaque labels are actually encrypted, rather than just obfuscated by a hash which can be easily rainbow-tabled.

Personally, I think it'd be fine to add a pepper to the hashed events, and require at the application level that for labels to work in encrypted rooms, the new user must be brought up to speed on the pepper (e.g. by the inviter sharing the pepper in an encrypted message, possibly to-device, after having invited them).

This is simpler than using opaque IDs for the unencrypted event headers, as there's only one pepper that needs to be shared to new users, rather than the whole set of opaque->real label mappings.


Instead of using hashes to identify labels in encrypted messages, using random
opaque strings was also considered. Bearing in mind that we need to be able to
use the label identifiers to filter the history of the room server-side (because
we're not expecting clients to know about the whole history of the room), this
solution had the following downsides, all originating from the fact that nothing
would prevent 1000 clients from using each a different identifier:

* filtering would have serious performances issues in E2EE rooms, as the server
would need to return all events it knows about which label identifier is any
of the 1000 identifiers provided by the client, which is quite expensive to
do.

* it would be impossible for a filtered `/message` (or `/sync`) request to
include every event matching the desired label because we can't expect a
client to know about every identifier that has been used in the whole history
of the room, or about the fact that another client might suddenly decide to
use another identifier for the same label text, and include those identifiers
in its filtered request.

Another proposed solution would be to use peppered hashes, and to store the
pepper in the encrypted event. However, this solution would have the same
downsides as described above.

## Security considerations

The proposed solution for encrypted rooms, despite being the only one we could
think of when writing this proposal that would make filtering possible while
obscuring the labels to some level, isn't ideal as it still allows servers to
figure out labels by computing [rainbow
tables](https://en.wikipedia.org/wiki/Rainbow_table).

Because of this, clients might want to limit the use of this feature in
encrypted rooms, for example by enabling it with an opt-in option in the
settings, or showing a warning message to the users.

It is likely that this solution will be replaced as part of a future proposal
once a more fitting solution is found.

## Unstable prefix

Unstable implementations should hook up `org.matrix.labels` rather than
`m.labels`. When defining filters, they should also use `org.matrix.labels` and
`org.matrix.not_labels` in the `EventFilter` object.

Additionally, servers implementing this feature should advertise that they do so
by exposing a `org.matrix.label_based_filtering` flag in the `unstable_features`
part of the `/versions` response.