matrix-org · ara4n · Oct 22, 2019 · Oct 28, 2019 · Oct 28, 2019 · Oct 29, 2019
diff --git a/proposals/2326-label-based-filtering.md b/proposals/2326-label-based-filtering.md
@@ -0,0 +1,210 @@
+# Label based filtering
+
+## Problem
+
+Rooms often contain overlapping conversations, which Matrix should help users
+navigate.
+
+## Context
+
+We already have the concept of 'Replies' to define which messages are responses
+to which, which [MSC1849](https://github.com/matrix-org/matrix-doc/pull/1849)
+proposes extending into a generic mechanism for defining threads which could (in
+future) be paginated both depth-wise and breadth-wise.  Meanwhile,
+[MSC1198](https://github.com/matrix-org/matrix-doc/issues/1198) is an alternate
+proposal for threading, which separates conversations into high-level "swim
+lanes" with a new `POST /rooms/{roomId}/thread` API.
+
+However, fully generic threading (which could be used to implement forum or
+email style semantics) runs a risk of being overly complicated to specify and
+implement and could result in feature creep. This is doubly true if you try to
+implement retrospective threading (e.g. to allow moderators to split off
+messages into their own thread, as you might do in a forum or to help manage
+conversation in a busy chatroom).
+
+Therefore, this is a simpler proposal to allow messages in a room to be filtered
+based on a given label in order to give basic one-layer-deep threading
+functionality.
+
+## Proposal
+
+We let users specify an optional `m.labels` field onto the events. This field
+lists freeform text labels:
+
+```json
+{
+  // ...
+  "m.labels": [ "somelabel" ]
+}
+```
+
+The labels are expected to be insensitive to case, therefore clients are
+expected to lowercase them before sending them to servers. A label's length is
+limited to a maximum of 100 characters.
+
+Labels which are prefixed with # are expected to be user-visible and exposed to
+the user by clients as a hashtag, letting the user filter their current room by
+the various hashtags present within it. Labels which are not prefixed with # are
+expected to be hidden from the user by clients (so that they can be used as
+e.g. thread IDs bridged from another platform).
+
+Clients can use these to filter the overlapping conversations in a room into
+different topics. The labels could also be used when bridging as a hashtag to
+help manage the disconnect which can happen when bridging a threaded room to an
+unthreaded one.
+
+Clients are expected to explicitly set the label on a message if the user's
+intention is to respond as part of a given labelled topic.  For instance, if the
+user is currently filtered to only view messages with a given label, then new
+messages sent should use the same label. Similarly if the user sends a reply to
+a given message, that reply should typically use the same labels as the message
+being replied to.
+
+When a user wants to filter a room to given label(s), it defines a filter for
+use with `/sync`, `/context`, `/search` or `/messages` to limit appropriately.
+This is done by new `labels` and `not_labels` fields to the `EventFilter`
+object, which specifies a list of labels to include or exclude in the given
+filter.
+
+Senders may edit the `m.label` fields in order to change the labels associated
+with an event. If an edit removes a label that was previously associated with
+the original event or a past edit of it, neither the original event nor an edit
+of it should be returned by the server when filtering for events with that
+label.
+
+When editing the list of labels associated with an encrypted event, clients must
+set the updated list of labels in the `content` field of the encrypted event in
+addition with the `m.new_content` field of the decrypted event's `content`
+field, so that servers can update the list of labels associated with the
+original event accordingly.
+
+When sending an edit of an event that has labels attached to it, clients are
+expected to provide a list of labels, even if the edit doesn't add or remove any
+label from the list provided in the original event or its latest edit (in this
+case, the list is the same as the one provided in the original event or its
+latest edit).
+
+### Encrypted rooms
+
+In encrypted rooms, the `m.label` field of `m.room.encrypted` events contains,
+for each label of the event that's being encrypted, a SHA256 hash of the
+concatenation of the text label and the ID of the room the event is being sent
+to, i.e. `hash = SHA256(label_text + room_id)`.
+
+The reason behind using a hash built from the text label and the ID of the room
+here instead of e.g. a random opaque string or a peppered hash is to maintain
+consistency of the key without having access to the entire history of the room
+or exposing the actual text of the label to the server, so that e.g. a new
+client joining the room would be able to use the same key for the same label as
+any other client. See the ["Alternative solutions"](#alternative-solutions) for
+more information on this point.
+
+When filtering events based on their label(s), clients are expected to use the
+hash of the label(s) to filter in or out instead of the actual label text.
+
+#### Example
+
+Consider a label `#fun` on a message sent to a room which ID is
+`!someroom:example.com`. Before encryption, the message would be:
+
+```json
+{
+  "type": "m.room.message",
+  "content": {
+    "body": "who wants to go down the pub?",
+    "msgtype": "m.text",
+    "m.labels": [ "#fun" ]
+  }
+}
+```
+
+`3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a` is the SHA256
+hash of the string `#fun!someroom:example.com`. Here's an example code
+(JavaScript) to compute it:
+
+```javascript
+label_key_unhashed = "#fun" + "!someroom:example.com"
+hash = crypto.createHash('sha256');
+hash.write(label_key_unhashed);
+label_key = hash.digest("hex"); // 3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a
+```
+
+Once encrypted, the event would become:
+
+```json
+{
+  "type": "m.room.encrypted",
+  "content": {
+    "algorithm": "m.megolm.v1.aes-sha2",
+    "ciphertext": "AwgAEpABm6.......",
+    "device_id": "SOLZHNGTZT",
+    "sender_key": "FRlkQA1enABuOH4xipzJJ/oD8fxiQHj6jrAyyrvzSTY",
+    "session_id": "JPWczbhnAivenK3qRwqLLBQu4W13fz1lqQpXDlpZzCg",
+    "m.labels": [
+      "3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a"
+    ]
+  }
+}
+```
+
+## Problems
+
+Do we care about internationalising hashtags?
+
+Too many threading APIs?
+
+Using hashes means that servers could be inclined to compute rainbow tables to
+read labels on encrypted messages. However, since we're using the room ID as
+some kind of hash, it makes it much more expensive to do because it would mean
+maintaining one rainbow table for each encrypted room it's in, which would
+probably make it not worth the trouble.
+
+## Alternative solutions
+
+Instead of using hashes to identify labels in encrypted messages, using random
+opaque strings was also considered. Bearing in mind that we need to be able to
+use the label identifiers to filter the history of the room server-side (because
+we're not expecting clients to know about the whole history of the room), this
+solution had the following downsides, all originating from the fact that nothing
+would prevent 1000 clients from using each a different identifier:
+
+* filtering would have serious performances issues in E2EE rooms, as the server
+  would need to return all events it knows about which label identifier is any
+  of the 1000 identifiers provided by the client, which is quite expensive to
+  do.
+
+* it would be impossible for a filtered `/message` (or `/sync`) request to
+  include every event matching the desired label because we can't expect a
+  client to know about every identifier that has been used in the whole history
+  of the room, or about the fact that another client might suddenly decide to
+  use another identifier for the same label text, and include those identifiers
+  in its filtered request.
+
+Another proposed solution would be to use peppered hashes, and to store the
+pepper in the encrypted event. However, this solution would have the same
+downsides as described above.
+
+## Security considerations
+
+The proposed solution for encrypted rooms, despite being the only one we could
+think of when writing this proposal that would make filtering possible while
+obscuring the labels to some level, isn't ideal as it still allows servers to
+figure out labels by computing [rainbow
+tables](https://en.wikipedia.org/wiki/Rainbow_table).
+
+Because of this, clients might want to limit the use of this feature in
+encrypted rooms, for example by enabling it with an opt-in option in the
+settings, or showing a warning message to the users.
+
+It is likely that this solution will be replaced as part of a future proposal
+once a more fitting solution is found.
+
+## Unstable prefix
+
+Unstable implementations should hook up `org.matrix.labels` rather than
+`m.labels`. When defining filters, they should also use `org.matrix.labels` and
+`org.matrix.not_labels` in the `EventFilter` object.
+
+Additionally, servers implementing this feature should advertise that they do so
+by exposing a `org.matrix.label_based_filtering` flag in the `unstable_features`
+part of the `/versions` response.