Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Federation sender worker not federating messages when using multiple stream writers #14928

Open
tonkku107 opened this issue Jan 27, 2023 · 28 comments
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication) O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@tonkku107
Copy link

tonkku107 commented Jan 27, 2023

Description

My federation sender worker hasn't been federating outgoing messages since updating to 1.76.0rc1. Even messages sent before doing the first faster join haven't gone through. Presence and typing notifications do seem to go through. Incoming messages are being received normally.

This only happens when multiple event stream writers are configured.

Steps to reproduce

  • set up at least one federation sender
  • add multiple generic workers as event stream writers
  • try to send messages

Homeserver

tonkku.me

Synapse Version

1.76.0rc1

Installation Method

Docker (matrixdotorg/synapse)

Database

single postgresql database

Workers

Multiple workers

Platform

Running in docker containers on a server with Ubuntu 20.04.5, AMD Ryzen 5 3600, 64GB RAM.

Configuration

Using a worker setup with one main synapse instance, 4 generic workers and 1 federation sender. Presence is enabled.
Experimental features enabled:

msc3026_enabled: true
msc2815_enabled: true
msc3773_enabled: true
msc3882_enabled: true
msc3886_endpoint: 'https://rendezvous.lab.element.dev'
msc3912_enabled: true

worker configuration:

send_federation: false
federation_sender_instances:
  - federation_sender1
instance_map:
  generic_worker1:
    host: generic-worker1
    port: 9111
  generic_worker2:
    host: generic-worker2
    port: 9112
  generic_worker3:
    host: generic-worker3
    port: 9113
  generic_worker4:
    host: generic-worker4
    port: 9114
stream_writers:
  events:
    - generic_worker1
    - generic_worker2
    - generic_worker3
    - generic_worker4
redis:
  enabled: true
  host: redis
worker_app: synapse.app.federation_sender
worker_name: federation_sender1

worker_replication_host: synapse
worker_replication_http_port: 9100

worker_log_config: "/data/tonkku.me.log.config"

Relevant log output

tonkku-matrix-server-generic-worker-4 | 2023-01-27 07:50:29,215 - synapse.replication.http.send_events - 147 - INFO - POST-31712 - Got batch of events to send, last ID of batch is: $1Kq2JWRsQNYq3BdNb_D1vUPeJciH7j39-l7dFyDqhWE, sending into room: !KibRmKHfMtoXfmIqBP:tonkku.me
tonkku-matrix-server-generic-worker-4 | 2023-01-27 07:50:29,232 - synapse.access.http.9114 - 460 - INFO - POST-31712 - ::ffff:192.168.240.6 - 9114 - {None} Processed request: 0.017sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 149B 200 "POST /_synapse/replication/send_events/frDXoxQLuc HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:29,233 - synapse.http.client - 460 - INFO - PUT-10818 - Received response to POST http://generic-worker4:9114/_synapse/replication/send_events/frDXoxQLuc: 200
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:29,233 - synapse.replication.tcp.client - 372 - INFO - PUT-10818 - Waiting for repl stream 'events' to reach 817572 (generic_worker4)
tonkku-matrix-server-generic-worker-4 | 2023-01-27 07:50:29,237 - synapse.replication.tcp.resource - 193 - INFO - replication_notifier-725 - Streaming: events -> 817572
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:29,239 - synapse.replication.tcp.client - 384 - INFO - PUT-10818 - Finished waiting for repl stream 'events' to reach 817572 (generic_worker4)
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:29,240 - synapse.access.http.9102 - 460 - INFO - PUT-10818 - [redacted IP] - 9102 - {@tonkku:tonkku.me} Processed request: 0.049sec/0.000sec (0.005sec, 0.001sec) (0.003sec/0.013sec/15) 59B 200 "PUT /_matrix/client/r0/rooms/!KibRmKHfMtoXfmIqBP%3Atonkku.me/send/m.room.encrypted/m1674805828912.7 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [3 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,259 - synapse.access.http.9101 - 460 - INFO - GET-12340 - [redacted IP] - 9101 - {@tonkku:tonkku.me} Processed request: 4.449sec/0.000sec (0.007sec, 0.002sec) (0.024sec/0.045sec/18) 1272B 200 "GET /_matrix/client/r0/sync?filter=6&timeout=30000&since=m788039%7E3.817456%7E1.817571%7E2.817568_20566155_2601_2123712_33123_30_7014_2345778_0_3 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [1 dbevts]
tonkku-matrix-server  | 2023-01-27 07:50:29,452 - synapse.access.http.9100 - 460 - INFO - POST-64485 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.000sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,453 - synapse.http.client - 460 - INFO - GET-12342 - Received response to POST http://synapse:9100/_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me: 200
tonkku-matrix-server  | 2023-01-27 07:50:29,463 - synapse.access.http.8008 - 460 - INFO - POST-64484 - [redacted IP] - 8008 - {@tonkku:tonkku.me} Processed request: 0.012sec/0.001sec (0.000sec, 0.001sec) (0.001sec/0.008sec/4) 2B 200 "POST /_matrix/client/r0/rooms/!KibRmKHfMtoXfmIqBP%3Atonkku.me/read_markers HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [0 dbevts]
tonkku-matrix-server  | 2023-01-27 07:50:29,464 - synapse.replication.tcp.resource - 193 - INFO - replication_notifier-109167 - Streaming: account_data -> 33124
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,471 - synapse.access.http.9101 - 460 - INFO - GET-12342 - [redacted IP] - 9101 - {@tonkku:tonkku.me} Processed request: 0.021sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.003sec/4) 729B 200 "GET /_matrix/client/r0/sync?filter=6&timeout=30000&since=m788039%7E3.817456%7E1.817572%7E2.817568_20566155_2601_2123712_33123_30_7014_2345778_0_3 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [0 dbevts]
tonkku-matrix-server  | 2023-01-27 07:50:29,590 - synapse.access.http.9100 - 460 - INFO - POST-64486 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,591 - synapse.http.client - 460 - INFO - GET-12344 - Received response to POST http://synapse:9100/_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me: 200
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,892 - synapse.federation.transport.server.federation - 103 - INFO - PUT-12345 - Received txn 1673813262285 from matrix.hadallen.ca. (PDUs: 0, EDUs: 1)
tonkku-matrix-server  | 2023-01-27 07:50:29,896 - synapse.replication.http.federation - 186 - INFO - POST-64487 - Got 'm.presence' edu from matrix.hadallen.ca
tonkku-matrix-server  | 2023-01-27 07:50:29,897 - synapse.access.http.9100 - 460 - INFO - POST-64487 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.001sec/0.000sec (0.001sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/fed_send_edu/m.presence/jtnpiXDrUe HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,897 - synapse.http.client - 460 - INFO - PUT-12345 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.presence/jtnpiXDrUe: 200
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,907 - synapse.access.http.9101 - 460 - INFO - PUT-12345 - 24.68.83.130 - 9101 - {matrix.hadallen.ca} Processed request: 0.015sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.009sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1673813262285 HTTP/1.0" "Synapse/1.73.0" [0 dbevts]
tonkku-matrix-server-federation-sender-1 | 2023-01-27 07:50:30,172 - synapse.http.matrixfederationclient - 672 - INFO - federation_transaction_transmission_loop-393797 - {PUT-O-181577} [bancino.net] Request failed: PUT matrix://bancino.net/_matrix/federation/v1/send/1674746060146: ResponseNeverReceived:[CancelledError()]
tonkku-matrix-server-generic-worker-3 | 2023-01-27 07:50:30,760 - synapse.federation.transport.server.federation - 103 - INFO - PUT-14201 - Received txn 1674755342297 from matrix.org. (PDUs: 0, EDUs: 4)
tonkku-matrix-server  | 2023-01-27 07:50:30,764 - synapse.replication.http.federation - 186 - INFO - POST-64488 - Got 'm.device_list_update' edu from matrix.org
tonkku-matrix-server  | 2023-01-27 07:50:30,765 - synapse.replication.http.federation - 186 - INFO - POST-64489 - Got 'm.device_list_update' edu from matrix.org
tonkku-matrix-server  | 2023-01-27 07:50:30,766 - synapse.replication.http.federation - 186 - INFO - POST-64490 - Got 'm.device_list_update' edu from matrix.org
tonkku-matrix-server  | 2023-01-27 07:50:30,767 - synapse.replication.http.federation - 186 - INFO - POST-64491 - Got 'm.receipt' edu from matrix.org
tonkku-matrix-server  | 2023-01-27 07:50:30,769 - synapse.handlers.device - 1111 - INFO - POST-64488 - Received device list update for @hannahbarbarian:matrix.org, requiring resync: True. Devices: HMJZNAGOZS
tonkku-matrix-server  | 2023-01-27 07:50:30,770 - synapse.handlers.device - 1111 - INFO - POST-64489 - Received device list update for @lr7:matrix.org, requiring resync: True. Devices: WVPPLBOOFW
tonkku-matrix-server  | 2023-01-27 07:50:30,783 - synapse.replication.tcp.resource - 193 - INFO - replication_notifier-109168 - Streaming: receipts -> 2123713
tonkku-matrix-server  | 2023-01-27 07:50:30,788 - synapse.access.http.9100 - 460 - INFO - POST-64491 - ::ffff:192.168.240.5 - 9100 - {None} Processed request: 0.021sec/0.000sec (0.001sec, 0.001sec) (0.002sec/0.015sec/5) 279B 200 "POST /_synapse/replication/fed_send_edu/m.receipt/TtOesxZXle HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-3 | 2023-01-27 07:50:30,789 - synapse.http.client - 460 - INFO - PUT-14201 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.receipt/TtOesxZXle: 200
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:30,794 - synapse.access.http.9101 - 460 - INFO - GET-12344 - [redacted IP] - 9101 - {@tonkku:tonkku.me} Processed request: 1.205sec/0.000sec (0.003sec, 0.000sec) (0.001sec/0.006sec/4) 780B 200 "GET /_matrix/client/r0/sync?filter=6&timeout=30000&since=m788039%7E3.817456%7E1.817572%7E2.817568_20566155_2601_2123712_33124_30_7014_2345778_0_3 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [0 dbevts]
tonkku-matrix-server-federation-sender-1 | 2023-01-27 07:50:30,864 - synapse.http.matrixfederationclient - 672 - INFO - federation_transaction_transmission_loop-392140 - {PUT-O-180948} [synapsis.nadir.org] Request failed: PUT matrix://synapsis.nadir.org/_matrix/federation/v1/send/1674746059517: ResponseNeverReceived:[CancelledError()]
tonkku-matrix-server  | 2023-01-27 07:50:30,939 - synapse.access.http.9100 - 460 - INFO - POST-64492 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:30,940 - synapse.http.client - 460 - INFO - GET-12347 - Received response to POST http://synapse:9100/_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me: 200
tonkku-matrix-server  | 2023-01-27 07:50:31,086 - synapse.access.http.8008 - 460 - INFO - GET-64493 - 2a00:1098:84:1c8::158 - 8008 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 301B 200 "GET /_matrix/key/v2/server HTTP/1.0" "Synapse/1.76.0rc1 (b=matrix-org-hotfixes,484db60cd6)" [0 dbevts]
tonkku-matrix-server  | 2023-01-27 07:50:31,149 - synapse.http.matrixfederationclient - 299 - INFO - POST-64488 - {GET-O-141113} [matrix.org] Completed request: 200 OK in 0.38 secs, got 1892 bytes - GET matrix://matrix.org/_matrix/federation/v1/user/devices/%40hannahbarbarian%3Amatrix.org
tonkku-matrix-server  | 2023-01-27 07:50:31,150 - root - 1352 - INFO - POST-64488 - Skipping device list resync for @hannahbarbarian:matrix.org, as our cache matches already
tonkku-matrix-server  | 2023-01-27 07:50:31,151 - synapse.access.http.9100 - 460 - INFO - POST-64488 - ::ffff:192.168.240.5 - 9100 - {None} Processed request: 0.387sec/0.000sec (0.001sec, 0.001sec) (0.001sec/0.002sec/3) 279B 200 "POST /_synapse/replication/fed_send_edu/m.device_list_update/eIFEReVCkH HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-3 | 2023-01-27 07:50:31,152 - synapse.http.client - 460 - INFO - PUT-14201 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.device_list_update/eIFEReVCkH: 200
tonkku-matrix-server-federation-sender-1 | 2023-01-27 07:50:31,312 - synapse.http.federation.matrix_federation_agent - 363 - INFO - federation_transaction_transmission_loop-396260 - Failed to connect to eggy.cc:8448: User timeout caused connection failure.
tonkku-matrix-server-federation-sender-1 | 2023-01-27 07:50:31,313 - synapse.http.matrixfederationclient - 672 - INFO - federation_transaction_transmission_loop-396260 - {PUT-O-182682} [eggy.cc] Request failed: PUT matrix://eggy.cc/_matrix/federation/v1/send/1674746061248: TimeoutError('')
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:31,967 - synapse.federation.transport.server.federation - 103 - INFO - PUT-12348 - Received txn 1673566981718 from gruenhage.xyz. (PDUs: 0, EDUs: 1)
tonkku-matrix-server  | 2023-01-27 07:50:31,970 - synapse.replication.http.federation - 186 - INFO - POST-64494 - Got 'm.presence' edu from gruenhage.xyz
tonkku-matrix-server  | 2023-01-27 07:50:31,971 - synapse.access.http.9100 - 460 - INFO - POST-64494 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.001sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/fed_send_edu/m.presence/mwocLhmkrK HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:31,971 - synapse.http.client - 460 - INFO - PUT-12348 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.presence/mwocLhmkrK: 200
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:31,985 - synapse.access.http.9101 - 460 - INFO - PUT-12348 - 109.250.36.91 - 9101 - {gruenhage.xyz} Processed request: 0.019sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.013sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1673566981718 HTTP/1.0" "Synapse/1.74.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:32,695 - synapse.federation.transport.server.federation - 103 - INFO - PUT-10819 - Received txn 1674647102137 from strahlungsfrei.de. (PDUs: 0, EDUs: 1)
tonkku-matrix-server  | 2023-01-27 07:50:32,698 - synapse.replication.http.federation - 186 - INFO - POST-64495 - Got 'm.presence' edu from strahlungsfrei.de
tonkku-matrix-server  | 2023-01-27 07:50:32,710 - synapse.replication.tcp.resource - 193 - INFO - replication_notifier-109169 - Streaming: presence -> 20566156
tonkku-matrix-server  | 2023-01-27 07:50:32,713 - synapse.access.http.9100 - 460 - INFO - POST-64495 - ::ffff:192.168.240.6 - 9100 - {None} Processed request: 0.015sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.011sec/3) 279B 200 "POST /_synapse/replication/fed_send_edu/m.presence/tAVsrPDqdu HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:32,714 - synapse.http.client - 460 - INFO - PUT-10819 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.presence/tAVsrPDqdu: 200
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:32,716 - synapse.access.http.9101 - 460 - INFO - GET-12347 - [redacted IP] - 9101 - {@tonkku:tonkku.me} Processed request: 1.779sec/0.000sec (0.002sec, 0.000sec) (0.001sec/0.002sec/3) 425B 200 "GET /_matrix/client/r0/sync?filter=6&timeout=30000&since=m788039%7E3.817456%7E1.817572%7E2.817568_20566155_2601_2123713_33124_30_7014_2345778_0_3 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [0 dbevts]
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:32,717 - synapse.access.http.9102 - 460 - INFO - PUT-10819 - 37.221.197.227 - 9102 - {strahlungsfrei.de} Processed request: 0.022sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.002sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1674647102137 HTTP/1.0" "Synapse/1.75.0" [0 dbevts]
tonkku-matrix-server  | 2023-01-27 07:50:32,952 - synapse.access.http.9100 - 460 - INFO - POST-64496 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:32,952 - synapse.http.client - 460 - INFO - GET-12350 - Received response to POST http://synapse:9100/_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me: 200
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:33,864 - synapse.federation.transport.server.federation - 103 - INFO - PUT-10820 - Received txn 1674646692320 from roastburry.com. (PDUs: 0, EDUs: 1)
tonkku-matrix-server  | 2023-01-27 07:50:33,867 - synapse.replication.http.federation - 186 - INFO - POST-64497 - Got 'm.presence' edu from roastburry.com
tonkku-matrix-server  | 2023-01-27 07:50:33,868 - synapse.access.http.9100 - 460 - INFO - POST-64497 - ::ffff:192.168.240.6 - 9100 - {None} Processed request: 0.001sec/0.000sec (0.001sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/fed_send_edu/m.presence/LBeYTSWDXP HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:33,869 - synapse.http.client - 460 - INFO - PUT-10820 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.presence/LBeYTSWDXP: 200
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:33,875 - synapse.access.http.9102 - 460 - INFO - PUT-10820 - 99.46.147.41 - 9102 - {roastburry.com} Processed request: 0.012sec/0.000sec (0.002sec, 0.000sec) (0.001sec/0.006sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1674646692320 HTTP/1.0" "Synapse/1.74.0" [0 dbevts]tonkku-matrix-server-generic-worker-4 | 2023-01-27 07:50:29,215 - synapse.replication.http.send_events - 147 - INFO - POST-31712 - Got batch of events to send, last ID of batch is: $1Kq2JWRsQNYq3BdNb_D1vUPeJciH7j39-l7dFyDqhWE, sending into room: !KibRmKHfMtoXfmIqBP:tonkku.me
tonkku-matrix-server-generic-worker-4 | 2023-01-27 07:50:29,232 - synapse.access.http.9114 - 460 - INFO - POST-31712 - ::ffff:192.168.240.6 - 9114 - {None} Processed request: 0.017sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 149B 200 "POST /_synapse/replication/send_events/frDXoxQLuc HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:29,233 - synapse.http.client - 460 - INFO - PUT-10818 - Received response to POST http://generic-worker4:9114/_synapse/replication/send_events/frDXoxQLuc: 200
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:29,233 - synapse.replication.tcp.client - 372 - INFO - PUT-10818 - Waiting for repl stream 'events' to reach 817572 (generic_worker4)
tonkku-matrix-server-generic-worker-4 | 2023-01-27 07:50:29,237 - synapse.replication.tcp.resource - 193 - INFO - replication_notifier-725 - Streaming: events -> 817572
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:29,239 - synapse.replication.tcp.client - 384 - INFO - PUT-10818 - Finished waiting for repl stream 'events' to reach 817572 (generic_worker4)
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:29,240 - synapse.access.http.9102 - 460 - INFO - PUT-10818 - [redacted IP] - 9102 - {@tonkku:tonkku.me} Processed request: 0.049sec/0.000sec (0.005sec, 0.001sec) (0.003sec/0.013sec/15) 59B 200 "PUT /_matrix/client/r0/rooms/!KibRmKHfMtoXfmIqBP%3Atonkku.me/send/m.room.encrypted/m1674805828912.7 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [3 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,259 - synapse.access.http.9101 - 460 - INFO - GET-12340 - [redacted IP] - 9101 - {@tonkku:tonkku.me} Processed request: 4.449sec/0.000sec (0.007sec, 0.002sec) (0.024sec/0.045sec/18) 1272B 200 "GET /_matrix/client/r0/sync?filter=6&timeout=30000&since=m788039%7E3.817456%7E1.817571%7E2.817568_20566155_2601_2123712_33123_30_7014_2345778_0_3 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [1 dbevts]
tonkku-matrix-server  | 2023-01-27 07:50:29,452 - synapse.access.http.9100 - 460 - INFO - POST-64485 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.000sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,453 - synapse.http.client - 460 - INFO - GET-12342 - Received response to POST http://synapse:9100/_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me: 200
tonkku-matrix-server  | 2023-01-27 07:50:29,463 - synapse.access.http.8008 - 460 - INFO - POST-64484 - [redacted IP] - 8008 - {@tonkku:tonkku.me} Processed request: 0.012sec/0.001sec (0.000sec, 0.001sec) (0.001sec/0.008sec/4) 2B 200 "POST /_matrix/client/r0/rooms/!KibRmKHfMtoXfmIqBP%3Atonkku.me/read_markers HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [0 dbevts]
tonkku-matrix-server  | 2023-01-27 07:50:29,464 - synapse.replication.tcp.resource - 193 - INFO - replication_notifier-109167 - Streaming: account_data -> 33124
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,471 - synapse.access.http.9101 - 460 - INFO - GET-12342 - [redacted IP] - 9101 - {@tonkku:tonkku.me} Processed request: 0.021sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.003sec/4) 729B 200 "GET /_matrix/client/r0/sync?filter=6&timeout=30000&since=m788039%7E3.817456%7E1.817572%7E2.817568_20566155_2601_2123712_33123_30_7014_2345778_0_3 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [0 dbevts]
tonkku-matrix-server  | 2023-01-27 07:50:29,590 - synapse.access.http.9100 - 460 - INFO - POST-64486 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,591 - synapse.http.client - 460 - INFO - GET-12344 - Received response to POST http://synapse:9100/_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me: 200
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,892 - synapse.federation.transport.server.federation - 103 - INFO - PUT-12345 - Received txn 1673813262285 from matrix.hadallen.ca. (PDUs: 0, EDUs: 1)
tonkku-matrix-server  | 2023-01-27 07:50:29,896 - synapse.replication.http.federation - 186 - INFO - POST-64487 - Got 'm.presence' edu from matrix.hadallen.ca
tonkku-matrix-server  | 2023-01-27 07:50:29,897 - synapse.access.http.9100 - 460 - INFO - POST-64487 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.001sec/0.000sec (0.001sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/fed_send_edu/m.presence/jtnpiXDrUe HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,897 - synapse.http.client - 460 - INFO - PUT-12345 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.presence/jtnpiXDrUe: 200
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:29,907 - synapse.access.http.9101 - 460 - INFO - PUT-12345 - 24.68.83.130 - 9101 - {matrix.hadallen.ca} Processed request: 0.015sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.009sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1673813262285 HTTP/1.0" "Synapse/1.73.0" [0 dbevts]
tonkku-matrix-server-federation-sender-1 | 2023-01-27 07:50:30,172 - synapse.http.matrixfederationclient - 672 - INFO - federation_transaction_transmission_loop-393797 - {PUT-O-181577} [bancino.net] Request failed: PUT matrix://bancino.net/_matrix/federation/v1/send/1674746060146: ResponseNeverReceived:[CancelledError()]
tonkku-matrix-server-generic-worker-3 | 2023-01-27 07:50:30,760 - synapse.federation.transport.server.federation - 103 - INFO - PUT-14201 - Received txn 1674755342297 from matrix.org. (PDUs: 0, EDUs: 4)
tonkku-matrix-server  | 2023-01-27 07:50:30,764 - synapse.replication.http.federation - 186 - INFO - POST-64488 - Got 'm.device_list_update' edu from matrix.org
tonkku-matrix-server  | 2023-01-27 07:50:30,765 - synapse.replication.http.federation - 186 - INFO - POST-64489 - Got 'm.device_list_update' edu from matrix.org
tonkku-matrix-server  | 2023-01-27 07:50:30,766 - synapse.replication.http.federation - 186 - INFO - POST-64490 - Got 'm.device_list_update' edu from matrix.org
tonkku-matrix-server  | 2023-01-27 07:50:30,767 - synapse.replication.http.federation - 186 - INFO - POST-64491 - Got 'm.receipt' edu from matrix.org
tonkku-matrix-server  | 2023-01-27 07:50:30,769 - synapse.handlers.device - 1111 - INFO - POST-64488 - Received device list update for @hannahbarbarian:matrix.org, requiring resync: True. Devices: HMJZNAGOZS
tonkku-matrix-server  | 2023-01-27 07:50:30,770 - synapse.handlers.device - 1111 - INFO - POST-64489 - Received device list update for @lr7:matrix.org, requiring resync: True. Devices: WVPPLBOOFW
tonkku-matrix-server  | 2023-01-27 07:50:30,783 - synapse.replication.tcp.resource - 193 - INFO - replication_notifier-109168 - Streaming: receipts -> 2123713
tonkku-matrix-server  | 2023-01-27 07:50:30,788 - synapse.access.http.9100 - 460 - INFO - POST-64491 - ::ffff:192.168.240.5 - 9100 - {None} Processed request: 0.021sec/0.000sec (0.001sec, 0.001sec) (0.002sec/0.015sec/5) 279B 200 "POST /_synapse/replication/fed_send_edu/m.receipt/TtOesxZXle HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-3 | 2023-01-27 07:50:30,789 - synapse.http.client - 460 - INFO - PUT-14201 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.receipt/TtOesxZXle: 200
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:30,794 - synapse.access.http.9101 - 460 - INFO - GET-12344 - [redacted IP] - 9101 - {@tonkku:tonkku.me} Processed request: 1.205sec/0.000sec (0.003sec, 0.000sec) (0.001sec/0.006sec/4) 780B 200 "GET /_matrix/client/r0/sync?filter=6&timeout=30000&since=m788039%7E3.817456%7E1.817572%7E2.817568_20566155_2601_2123712_33124_30_7014_2345778_0_3 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [0 dbevts]
tonkku-matrix-server-federation-sender-1 | 2023-01-27 07:50:30,864 - synapse.http.matrixfederationclient - 672 - INFO - federation_transaction_transmission_loop-392140 - {PUT-O-180948} [synapsis.nadir.org] Request failed: PUT matrix://synapsis.nadir.org/_matrix/federation/v1/send/1674746059517: ResponseNeverReceived:[CancelledError()]
tonkku-matrix-server  | 2023-01-27 07:50:30,939 - synapse.access.http.9100 - 460 - INFO - POST-64492 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:30,940 - synapse.http.client - 460 - INFO - GET-12347 - Received response to POST http://synapse:9100/_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me: 200
tonkku-matrix-server  | 2023-01-27 07:50:31,086 - synapse.access.http.8008 - 460 - INFO - GET-64493 - 2a00:1098:84:1c8::158 - 8008 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 301B 200 "GET /_matrix/key/v2/server HTTP/1.0" "Synapse/1.76.0rc1 (b=matrix-org-hotfixes,484db60cd6)" [0 dbevts]
tonkku-matrix-server  | 2023-01-27 07:50:31,149 - synapse.http.matrixfederationclient - 299 - INFO - POST-64488 - {GET-O-141113} [matrix.org] Completed request: 200 OK in 0.38 secs, got 1892 bytes - GET matrix://matrix.org/_matrix/federation/v1/user/devices/%40hannahbarbarian%3Amatrix.org
tonkku-matrix-server  | 2023-01-27 07:50:31,150 - root - 1352 - INFO - POST-64488 - Skipping device list resync for @hannahbarbarian:matrix.org, as our cache matches already
tonkku-matrix-server  | 2023-01-27 07:50:31,151 - synapse.access.http.9100 - 460 - INFO - POST-64488 - ::ffff:192.168.240.5 - 9100 - {None} Processed request: 0.387sec/0.000sec (0.001sec, 0.001sec) (0.001sec/0.002sec/3) 279B 200 "POST /_synapse/replication/fed_send_edu/m.device_list_update/eIFEReVCkH HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-3 | 2023-01-27 07:50:31,152 - synapse.http.client - 460 - INFO - PUT-14201 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.device_list_update/eIFEReVCkH: 200
tonkku-matrix-server-federation-sender-1 | 2023-01-27 07:50:31,312 - synapse.http.federation.matrix_federation_agent - 363 - INFO - federation_transaction_transmission_loop-396260 - Failed to connect to eggy.cc:8448: User timeout caused connection failure.
tonkku-matrix-server-federation-sender-1 | 2023-01-27 07:50:31,313 - synapse.http.matrixfederationclient - 672 - INFO - federation_transaction_transmission_loop-396260 - {PUT-O-182682} [eggy.cc] Request failed: PUT matrix://eggy.cc/_matrix/federation/v1/send/1674746061248: TimeoutError('')
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:31,967 - synapse.federation.transport.server.federation - 103 - INFO - PUT-12348 - Received txn 1673566981718 from gruenhage.xyz. (PDUs: 0, EDUs: 1)
tonkku-matrix-server  | 2023-01-27 07:50:31,970 - synapse.replication.http.federation - 186 - INFO - POST-64494 - Got 'm.presence' edu from gruenhage.xyz
tonkku-matrix-server  | 2023-01-27 07:50:31,971 - synapse.access.http.9100 - 460 - INFO - POST-64494 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.001sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/fed_send_edu/m.presence/mwocLhmkrK HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:31,971 - synapse.http.client - 460 - INFO - PUT-12348 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.presence/mwocLhmkrK: 200
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:31,985 - synapse.access.http.9101 - 460 - INFO - PUT-12348 - 109.250.36.91 - 9101 - {gruenhage.xyz} Processed request: 0.019sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.013sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1673566981718 HTTP/1.0" "Synapse/1.74.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:32,695 - synapse.federation.transport.server.federation - 103 - INFO - PUT-10819 - Received txn 1674647102137 from strahlungsfrei.de. (PDUs: 0, EDUs: 1)
tonkku-matrix-server  | 2023-01-27 07:50:32,698 - synapse.replication.http.federation - 186 - INFO - POST-64495 - Got 'm.presence' edu from strahlungsfrei.de
tonkku-matrix-server  | 2023-01-27 07:50:32,710 - synapse.replication.tcp.resource - 193 - INFO - replication_notifier-109169 - Streaming: presence -> 20566156
tonkku-matrix-server  | 2023-01-27 07:50:32,713 - synapse.access.http.9100 - 460 - INFO - POST-64495 - ::ffff:192.168.240.6 - 9100 - {None} Processed request: 0.015sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.011sec/3) 279B 200 "POST /_synapse/replication/fed_send_edu/m.presence/tAVsrPDqdu HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:32,714 - synapse.http.client - 460 - INFO - PUT-10819 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.presence/tAVsrPDqdu: 200
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:32,716 - synapse.access.http.9101 - 460 - INFO - GET-12347 - [redacted IP] - 9101 - {@tonkku:tonkku.me} Processed request: 1.779sec/0.000sec (0.002sec, 0.000sec) (0.001sec/0.002sec/3) 425B 200 "GET /_matrix/client/r0/sync?filter=6&timeout=30000&since=m788039%7E3.817456%7E1.817572%7E2.817568_20566155_2601_2123713_33124_30_7014_2345778_0_3 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [0 dbevts]
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:32,717 - synapse.access.http.9102 - 460 - INFO - PUT-10819 - 37.221.197.227 - 9102 - {strahlungsfrei.de} Processed request: 0.022sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.002sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1674647102137 HTTP/1.0" "Synapse/1.75.0" [0 dbevts]
tonkku-matrix-server  | 2023-01-27 07:50:32,952 - synapse.access.http.9100 - 460 - INFO - POST-64496 - ::ffff:192.168.240.7 - 9100 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-1 | 2023-01-27 07:50:32,952 - synapse.http.client - 460 - INFO - GET-12350 - Received response to POST http://synapse:9100/_synapse/replication/presence_set_state/%40tonkku%3Atonkku.me: 200
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:33,864 - synapse.federation.transport.server.federation - 103 - INFO - PUT-10820 - Received txn 1674646692320 from roastburry.com. (PDUs: 0, EDUs: 1)
tonkku-matrix-server  | 2023-01-27 07:50:33,867 - synapse.replication.http.federation - 186 - INFO - POST-64497 - Got 'm.presence' edu from roastburry.com
tonkku-matrix-server  | 2023-01-27 07:50:33,868 - synapse.access.http.9100 - 460 - INFO - POST-64497 - ::ffff:192.168.240.6 - 9100 - {None} Processed request: 0.001sec/0.000sec (0.001sec, 0.000sec) (0.000sec/0.000sec/0) 279B 200 "POST /_synapse/replication/fed_send_edu/m.presence/LBeYTSWDXP HTTP/1.1" "Synapse/1.76.0rc1" [0 dbevts]
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:33,869 - synapse.http.client - 460 - INFO - PUT-10820 - Received response to POST http://synapse:9100/_synapse/replication/fed_send_edu/m.presence/LBeYTSWDXP: 200
tonkku-matrix-server-generic-worker-2 | 2023-01-27 07:50:33,875 - synapse.access.http.9102 - 460 - INFO - PUT-10820 - 99.46.147.41 - 9102 - {roastburry.com} Processed request: 0.012sec/0.000sec (0.002sec, 0.000sec) (0.001sec/0.006sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1674646692320 HTTP/1.0" "Synapse/1.74.0" [0 dbevts]

Anything else that would be useful to know?

The logs are a snippet of a few seconds starting from sending a message to a room with members from matrix.org and kde.org. You can see that an outgoing federation request is never even attempted to either of those servers

@DMRobertson
Copy link
Contributor

My federation sender worker hasn't been federating outgoing messages since updating to 1.76.0rc1.

Outgoing messages in

  • all rooms?
  • some rooms?
  • only rooms that you faster-joined?

The logs are a snippet of a few seconds starting from sending a message to a room with members from matrix.org and kde.org.

I think the sample there doesn't have enough context to diagnose. Are you able to share the logs privately with me? If so, please contact me at @dmrobertson:matrix.org or [email protected]. (Email is probably safer if your HS is struggling to send outbound traffic.)

If not:

  • Are your logs' timestamps in UTC, or some other timezone?
  • Can you share the room ID of the room mentioned so I can cross-reference the logs on matrix.org?
  • Did you join that room after the upgrade to 1.76.0rc1?
  • Are there any error log lines since your upgrade to 1.76.0rc1?
    • It's also worth grepping the logs for the room ID and for "partial_state". But it is probably easier for me to do this rather than do this asynchronously via GitHub.

@DMRobertson DMRobertson added the X-Needs-Info This issue is blocked awaiting information from the reporter label Jan 27, 2023
@tonkku107
Copy link
Author

This is for all rooms, even those that I was in before updating. The timestamps are in UTC.

The room id for the snippet is !KibRmKHfMtoXfmIqBP:tonkku.me. (I chose that because it's small and easy to read the logs of)
The room was created by me and I was already in it before upgrading. I've also tried to send messages in synapse admins.
For reference, I upgraded at 2023-01-25 20:55:54 UTC

I couldn't find any unusual errors myself. I'll email you the full logs since upgrading once I can get them to an emailable size.

@DMRobertson
Copy link
Contributor

Many thanks for providing logs.

I took the earliest log file in the room you mentioned as an example.

2023-01-25 21:10:44,457 - synapse.access.http.9102 - 460 - INFO - PUT-2089 - 84.231.192.133 - 9102 - {@tonkku:tonkku.me} Processed request: 0.056sec/0.000sec (0.003sec, 0.│
001sec) (0.002sec/0.011sec/11) 59B 200 "PUT /_matrix/client/r0/rooms/!KibRmKHfMtoXfmIqBP%3Atonkku.me/send/m.room.encrypted/m1674681044230.8 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, li│
ke Gecko) Element/1.11.20 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36" [2 dbevts]  

I couldn't see any error messages associated with PUT-2089 on that worker, so I assume that the event was persisted correctly on your homeserver.

I cross-referenced with Matrix.org logs, but I could find no federation transactions sent from tonkku.me with a PDU in them (i.e. messages aren't getting through, but presence and typing are). The logs from your federation sender show the same thing.

I'm not sure why this is happening; I suspect something might have regressed involving the way that workers communicate. Can you rollback to 1.75.0, and let us know if your messages are sent across federation? It would also be interesting to see if the problem remains when using a single-worker deployment on 1.76.0rc1 too.

@DMRobertson
Copy link
Contributor

It would also be useful to know if disabling faster joins on 1.76.0rc1 fixes the problem, by setting

experimental:
    faster_joins: false

in config.

@tonkku107
Copy link
Author

tonkku107 commented Jan 27, 2023

Disabling faster joins had no effect.
After downgrading to 1.75.0 synapse has started sending all the messages that weren't sent earlier. However, Element is failing to sync with "Invalid stream token". Clearing element's cache solved that issue.

@tonkku107
Copy link
Author

And I've just confirmed that 1.76.0rc1 with no workers federates correctly.

@squahtx
Copy link
Contributor

squahtx commented Jan 27, 2023

Could you test again on 1.76.0rc1 with workers? Perhaps restarting Synapse is what fixed the issue.

@tonkku107
Copy link
Author

tonkku107 commented Jan 27, 2023

I did do a restart yesterday. @DMRobertson can probably confirm by looking at the logs.
edit: I also restarted when I tested disabling faster joins...

@squahtx squahtx added A-Federated-Join joins over federation generally suck S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Jan 27, 2023
@tonkku107
Copy link
Author

This wasn't an error when joining. I would suggest the A-Workers label

@squahtx squahtx added the A-Workers Problems related to running Synapse in Worker Mode (or replication) label Jan 27, 2023
@squahtx
Copy link
Contributor

squahtx commented Jan 27, 2023

Good call. The problem sounds like it's caused by one of the faster joins changes, so I'll leave the A-Federated-Join label on for now.

@squahtx
Copy link
Contributor

squahtx commented Jan 27, 2023

It would also be useful to know if disabling faster joins on 1.76.0rc1 fixes the problem, by setting

experimental:
    faster_joins: false

in config.

NB: this only prevents further faster joins from taking place. All the faster joins machinery will still apply for existing partial state rooms.

I wonder if there is a stuck partial state room on your server. Could you dump the partial_state_rooms and partial_state_rooms_servers tables in postgres?

SELECT * FROM partial_state_rooms_servers;
SELECT room_id, COUNT(*) FROM partial_state_rooms_servers GROUP BY room_id;

@tonkku107
Copy link
Author

Considering that even messages that I sent before doing my first faster join weren't sent, I doubt that's the issue.

synapse=# SELECT * FROM partial_state_rooms_servers;
 room_id | server_name 
---------+-------------
(0 rows)

synapse=# SELECT room_id, COUNT(*) FROM partial_state_rooms_servers GROUP BY room_id;
 room_id | count 
---------+-------
(0 rows)

@squahtx squahtx removed the A-Federated-Join joins over federation generally suck label Jan 27, 2023
@tonkku107
Copy link
Author

During that time I was trying to ask the synapse admins room about the space I joined having no results from the hierarchy api request. Now that I've downgraded, everything works on that end too. I don't know if it's related or not and it might need a separate issue but I thought I'd mention this now.

@cremesk
Copy link
Contributor

cremesk commented Jan 30, 2023

@tonkku107 maybe you need to update your federation worker config. since you use:

federation_sender_instances:
  - federation_sender1

in your shared config.

it was a change in synapse version 1.74.0. see: #14493

so you need to remove send_federation: false in the shared config and change the worker config to:

worker_app: synapse.app.generic_worker
worker_name: federation_sender1

worker_replication_host: synapse
worker_replication_http_port: 9100

worker_log_config: "/data/tonkku.me.log.config"

maybe it fixed your problem.

@tonkku107
Copy link
Author

That did work, but then I was also unable to reproduce the problem again, so I don't know if that's the actual solution.
Maybe there was something up with the database that then got resolved later by the downgrade? I did take a dump of the database before upgrading but would prefer not to roll back right now just to test. Could've also been redis that was acting up?

@tonkku107
Copy link
Author

Oh, nope. A simple restart later and the same happened (with the updated federation sender configuration and rc2).
Tried restarting a couple times with no luck, then rolled back to 1.75.0 and the messages were sent.
Switched back to 1.76.0rc2 and federation seems to work again for now.

@H-Shay
Copy link
Contributor

H-Shay commented Jan 31, 2023

Switched back to 1.76.0rc2 and federation seems to work again for now.

To be clear, this sounds like the issue is resolved for now? If so please let me know and we can close the issue.

@tonkku107
Copy link
Author

It is working right now but if I go and restart, there's a chance it gets stuck again and I need to downgrade for a while before upgrading again

@H-Shay
Copy link
Contributor

H-Shay commented Feb 2, 2023

Okay, I am going to close this issue for now, if it re-occurs please feel free to come back and we can re-open.

@H-Shay H-Shay closed this as completed Feb 2, 2023
@tonkku107
Copy link
Author

The issue has re-occurred after updating to 1.79.0. It seems I do not have permission to re-open the issue.

@H-Shay
Copy link
Contributor

H-Shay commented Mar 15, 2023

I can re-open, can you describe what has happened after updating to 1.79.0 + provide any relevant logs?

@H-Shay H-Shay reopened this Mar 15, 2023
@tonkku107
Copy link
Author

Just to be clear, it probably doesn't have anything to do with this version specifically, I've just been avoiding restarting except for updates.

Summary for the issue so far is that EDUs do get through federation but PDUs are never sent. Rolling back to 1.75.0, letting the pending messages get sent and updating again temporarily fixed the problem, but a restart could trigger it again. Now version 1.75.0 is probably too far to downgrade to safely...

It's hard to get a relevant log snippet to show what's going on because something is simply just not happening. Last time I sent @DMRobertson the complete set of logs via email.

@tonkku107
Copy link
Author

About 16h later I got notified as a message did happen to go through. I identified that the federation sender picked back up at 13:44 UTC.
The only thing I could really gather from the logs are a few

2023-03-16 13:44:28,703 - synapse.federation.sender - 473 - INFO - process_event_queue_for_federation-909 - Unexpectedly did not have cached prev group for $-ddGflbghaBUA_3B_xfvldW3OEkR59cM7gM6OFVbdts

lines with different event ids. Those IDs match with the messages that were pending to be sent.
I have saved a copy of the logs now and could send them via email again if you want to take a closer look and see if it helps figure out the cause and why now of all times synapse starts sending them again.

@tonkku107
Copy link
Author

Happened again with the update to 1.80.0

@reivilibre
Copy link
Contributor

Would you mind checking if there are any partial state rooms at all, possibly that don't have entries in partial_state_rooms_servers?

SELECT * FROM partial_state_rooms;

Another thing I'd suggest trying would be to reduce to one event stream writer and see if that alleviates the problem:

stream_writers:
  events:
    - generic_worker1

I'd normally suggest trying with experimental features disabled, but I don't see how those you've enabled would really affect anything.

Is the issue with all rooms or just one?

@tonkku107
Copy link
Author

Would you mind checking if there are any partial state rooms at all, possibly that don't have entries in partial_state_rooms_servers?

SELECT * FROM partial_state_rooms;

nothing in partial_state_rooms or partial_state_rooms_servers right now.

Another thing I'd suggest trying would be to reduce to one event stream writer and see if that alleviates the problem:

stream_writers:
  events:
    - generic_worker1

I commented out the other writers in my config before updating to 1.81 and this time it didn't get "stuck". Will have to keep observing tho.

Is the issue with all rooms or just one?

Already answered at the beginning that it's for all rooms.

@H-Shay
Copy link
Contributor

H-Shay commented May 30, 2023

Is this still an issue?

@tonkku107
Copy link
Author

It seems switching to one stream writer has alleviated the issue. I think we could rename the issue to indicate that it's a problem when using multiple stream writers

@H-Shay H-Shay changed the title Federation sender worker not federating messages in 1.76.0rc1 Federation sender worker not federating messages when using multiple stream writers May 30, 2023
@H-Shay H-Shay removed the X-Needs-Info This issue is blocked awaiting information from the reporter label May 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication) O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

6 participants