-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Federation sender worker not federating messages when using multiple stream writers #14928
Comments
Outgoing messages in
I think the sample there doesn't have enough context to diagnose. Are you able to share the logs privately with me? If so, please contact me at If not:
|
This is for all rooms, even those that I was in before updating. The timestamps are in UTC. The room id for the snippet is I couldn't find any unusual errors myself. I'll email you the full logs since upgrading once I can get them to an emailable size. |
Many thanks for providing logs. I took the earliest log file in the room you mentioned as an example.
I couldn't see any error messages associated with PUT-2089 on that worker, so I assume that the event was persisted correctly on your homeserver. I cross-referenced with Matrix.org logs, but I could find no federation transactions sent from I'm not sure why this is happening; I suspect something might have regressed involving the way that workers communicate. Can you rollback to 1.75.0, and let us know if your messages are sent across federation? It would also be interesting to see if the problem remains when using a single-worker deployment on 1.76.0rc1 too. |
It would also be useful to know if disabling faster joins on 1.76.0rc1 fixes the problem, by setting
in config. |
Disabling faster joins had no effect. |
And I've just confirmed that 1.76.0rc1 with no workers federates correctly. |
Could you test again on 1.76.0rc1 with workers? Perhaps restarting Synapse is what fixed the issue. |
I did do a restart yesterday. @DMRobertson can probably confirm by looking at the logs. |
This wasn't an error when joining. I would suggest the |
Good call. The problem sounds like it's caused by one of the faster joins changes, so I'll leave the A-Federated-Join label on for now. |
NB: this only prevents further faster joins from taking place. All the faster joins machinery will still apply for existing partial state rooms. I wonder if there is a stuck partial state room on your server. Could you dump the SELECT * FROM partial_state_rooms_servers;
SELECT room_id, COUNT(*) FROM partial_state_rooms_servers GROUP BY room_id; |
Considering that even messages that I sent before doing my first faster join weren't sent, I doubt that's the issue.
|
During that time I was trying to ask the synapse admins room about the space I joined having no results from the hierarchy api request. Now that I've downgraded, everything works on that end too. I don't know if it's related or not and it might need a separate issue but I thought I'd mention this now. |
@tonkku107 maybe you need to update your federation worker config. since you use:
in your shared config. it was a change in synapse version 1.74.0. see: #14493 so you need to remove
maybe it fixed your problem. |
That did work, but then I was also unable to reproduce the problem again, so I don't know if that's the actual solution. |
Oh, nope. A simple restart later and the same happened (with the updated federation sender configuration and rc2). |
To be clear, this sounds like the issue is resolved for now? If so please let me know and we can close the issue. |
It is working right now but if I go and restart, there's a chance it gets stuck again and I need to downgrade for a while before upgrading again |
Okay, I am going to close this issue for now, if it re-occurs please feel free to come back and we can re-open. |
The issue has re-occurred after updating to 1.79.0. It seems I do not have permission to re-open the issue. |
I can re-open, can you describe what has happened after updating to 1.79.0 + provide any relevant logs? |
Just to be clear, it probably doesn't have anything to do with this version specifically, I've just been avoiding restarting except for updates. Summary for the issue so far is that EDUs do get through federation but PDUs are never sent. Rolling back to 1.75.0, letting the pending messages get sent and updating again temporarily fixed the problem, but a restart could trigger it again. Now version 1.75.0 is probably too far to downgrade to safely... It's hard to get a relevant log snippet to show what's going on because something is simply just not happening. Last time I sent @DMRobertson the complete set of logs via email. |
About 16h later I got notified as a message did happen to go through. I identified that the federation sender picked back up at 13:44 UTC.
lines with different event ids. Those IDs match with the messages that were pending to be sent. |
Happened again with the update to 1.80.0 |
Would you mind checking if there are any partial state rooms at all, possibly that don't have entries in
Another thing I'd suggest trying would be to reduce to one event stream writer and see if that alleviates the problem: stream_writers:
events:
- generic_worker1 I'd normally suggest trying with experimental features disabled, but I don't see how those you've enabled would really affect anything. Is the issue with all rooms or just one? |
nothing in
I commented out the other writers in my config before updating to 1.81 and this time it didn't get "stuck". Will have to keep observing tho.
Already answered at the beginning that it's for all rooms. |
Is this still an issue? |
It seems switching to one stream writer has alleviated the issue. I think we could rename the issue to indicate that it's a problem when using multiple stream writers |
Description
My federation sender worker hasn't been federating outgoing messages since updating to 1.76.0rc1. Even messages sent before doing the first faster join haven't gone through. Presence and typing notifications do seem to go through. Incoming messages are being received normally.
This only happens when multiple event stream writers are configured.
Steps to reproduce
Homeserver
tonkku.me
Synapse Version
1.76.0rc1
Installation Method
Docker (matrixdotorg/synapse)
Database
single postgresql database
Workers
Multiple workers
Platform
Running in docker containers on a server with Ubuntu 20.04.5, AMD Ryzen 5 3600, 64GB RAM.
Configuration
Using a worker setup with one main synapse instance, 4 generic workers and 1 federation sender. Presence is enabled.
Experimental features enabled:
worker configuration:
Relevant log output
Anything else that would be useful to know?
The logs are a snippet of a few seconds starting from sending a message to a room with members from matrix.org and kde.org. You can see that an outgoing federation request is never even attempted to either of those servers
The text was updated successfully, but these errors were encountered: