Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Require body for read receipts with user-agent exceptions #11157

Merged
merged 9 commits into from
Nov 9, 2021
5 changes: 5 additions & 0 deletions changelog.d/11156.misc
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Limit which clients are allowed to send read receipts without a body (matrix-org #11156)

synapse\rest\client\receipts.py

Added in a regex check.
rogersheu marked this conversation as resolved.
Show resolved Hide resolved
9 changes: 8 additions & 1 deletion synapse/rest/client/receipts.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,12 @@
# limitations under the License.

import logging
import re
from typing import TYPE_CHECKING, Tuple

from synapse.api.constants import ReadReceiptEventFields
from synapse.api.errors import Codes, SynapseError
from synapse.http import get_request_user_agent
from synapse.http.server import HttpServer
from synapse.http.servlet import RestServlet, parse_json_object_from_request
from synapse.http.site import SynapseRequest
Expand Down Expand Up @@ -52,7 +54,12 @@ async def on_POST(
if receipt_type != "m.read":
raise SynapseError(400, "Receipt type must be 'm.read'")

body = parse_json_object_from_request(request, allow_empty_body=True)
allow_empty_body = False
user_agent = get_request_user_agent(request)
if re.search(".*Riot.*", user_agent) or re.search("Element/1.[012].*", user_agent) or re.search("SchildiChat/1.[012].*", user_agent):
allow_empty_body = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#11156 forgot to mention, but this should be restricted to user agents that contain Android (could use "Android" in user_agent).

I would also be tempted to collapse Element and SchildiChat together:

re.search("(Element|SchildiChat)/1.[012].*", user_agent)

Note that re.search will match anywhere in the string (this is different to re.match, which matches only at the beginning).

re.search(".*Riot.*", user_agent) can then be re.search("Riot", user_agent) instead.
But at that point, you may as well just write "Riot" in user_agent since we don't need any regex powers to do that :).

Also note that in re.search("(Element|SchildiChat)/1.[012].*", user_agent), . matches any character, so this would also match Element/110 which may not be what we were after.

Out of interest, this would be what a combined pattern looks like:
r"(Riot|(Element|SchildiChat)/1\.[012]\.).*Android"
(This also checks that it's an Android client: here's an example string from my device Element/1.2.2 (Linux; U; Android 9; ...; MatrixAndroidSDK_X 0.0.1))

(if you're not aware of this trick: I use r" so that I don't have to escape backslashes by writing \\ instead of \).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh! Good catch on the Android part. So maybe we should wrap that whole thing and do something like:

user_agent = get_request_user_agent(request)
pattern = re.compile(r"(?:Element|SchildiChat)/1\.[012]\.")

allow_empty_body = False
if "Android" in user_agent:
    if pattern.match(user_agent) or "Riot" in user_agent:
        allow_empty_body = True

Does that look right? Should we make the pattern a module-level constant to prevent excess calls to compile?

This will exclude a tiny number of weird builds, like Element dbg/1.1.8-dev or SchildiChat[f]/1.2.2.sc44, but they're sufficiently uncommon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems right to me. Perhaps we ought to have a few test cases to ensure we don't mess this up big time.

I can't remember when I read this, but iirc you're better off using re.compile once or not at all: re.match and friends have an internal cache for compiled patterns, which re.compile bypasses (or did at the time).
My personal preference is indeed to make it a module-level constant and guarantee that it's compiled only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchildiChat[f]/1.2.2.sc44, but they're sufficiently uncommon.

This one makes me a little suspicious: does the [f] perhaps stand for F-Droid?

Maybe we ought to be nice and allow some wiggle room:

pattern = re.compile(r"(?:Element|SchildiChat)[^/]*/1\.[012]\.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked again, and all the [f] and [g] SchildiChats in the logs are version 1.2.2 or later, so they can be excluded from this pattern. There was one SchildiChat.Beta[f]/1.2.0.sc42-test1 which would presumably break, but again, not worth the complexity for that one weirdo with an old test build :)

Copy link
Contributor Author

@rogersheu rogersheu Oct 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was wondering about the periods and whether those (and other various parts of the string) had to be exact too.

What's the rationale behind using re.search vs. re.compile? It looks like re.search searches for substrings by default while re.compile can match exactly as written, right?

Perhaps the weird builds could just be specifically white-listed if push comes to shove and if there aren't too many of them.

(if you're not aware of this trick: I use r" so that I don't have to escape backslashes by writing \ instead of ).

Thanks for pointing that out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind using re.search vs. re.compile? It looks like re.search searches for substrings by default while re.compile can match exactly as written, right?

I think you might mean re.match where you say re.compile.

To use a regular expression like a.*b, the re module has to first build a "finite state machine". That machine does all the work (working out if there's a match; where it is; what the capture groups captured, ....). Creating the machine has a cost to it. After it's built, the machine can be reset and used again, without paying that build cost.

Python offers re.compile so you can build the state machine once and re-use it, and that's what @callahad proposes above.

Copy link
Contributor Author

@rogersheu rogersheu Oct 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DMRobertson ! Makes a lot of sense. I do see now that re.search was the incorrect choice for regex function. I had worked out the difference between re.search and re.match but you explained the difference between re.search/re.match and re.compile, which is what I was looking for.

To summarize (to the extent of my knowledge), in re.compile, the re module builds the finite state machine and doesn't have to be built again (but its state can be changed when called), while each call of re.match includes the building of a finite state machine each time (like re.compile, then pattern.match), taking up runtime/memory.

I can't remember when I read this, but iirc you're better off using re.compile once or not at all: re.match and friends have an internal cache for compiled patterns, which re.compile bypasses (or did at the time).
My personal preference is indeed to make it a module-level constant and guarantee that it's compiled only once.

"at the time?" Like during recent testing? (Apologies in advance if I'm asking too many questions about this.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @reivilibre meant "at the time I read the article".

FWIW there's https://docs.python.org/3/howto/regex.html#module-level-functions which writes:

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.

https://stackoverflow.com/a/61603344/5252017 has more to say. Apparently the cache size is currently 512 patterns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That's helpful!


body = parse_json_object_from_request(request, allow_empty_body)
hidden = body.get(ReadReceiptEventFields.MSC2285_HIDDEN, False)

if not isinstance(hidden, bool):
Expand Down