[BUG] Substantial performance degradation on TCP transport after transport refactor #61656

lukasraska · 2022-02-16T15:16:28Z

Description
The big transport refactor (#61450) that was meant to decouple some general logic from transport and easy the implementation of new transport methods brought an unwanted performance degradation to TCP transport.

In particular the performance degradation can be seen when new job is published to specific minion (target_lst) instead of broadcast. The previous logic for publishing job was pretty straightforward (https://github.com/saltstack/salt/blob/v3004/salt/transport/tcp.py#L1517 ):

Check if topic_lst was passed (this contains list of all minions the job is meant for)
Loop through all topics (all minions)
Find appropriate connected clients for the minion (there might be multiple connected for single minion because of reasons)
Write the job to the socket

While after the refactor the logic is mostly the opposite (https://github.com/saltstack/salt/blob/master/salt/transport/tcp.py#L910):

Check if topic_lst was passed
Loop through all topics
Loop through all clients
Check if client matches the topic and if so, write the job to the socket

In practice this means that if there is 10k connected minions and new job should be published to just one single minion, it's immediate (one iteration through topic_lst loop and lookup in dict), while after the refactor the worst case scenario is 10k iterations of the clients loop. If we publish job to multiple minions, this just gets worse. For 5k minions in a job in 10k minion setup, the previous complexity would be mostly equal to number of minions in a job, while now it's number of minions in a job times number of connected clients.

The source of the issue stems from how the clients are looked up and how authentication mechanism works in Salt:

When minion connects, it's added into self.clients set (https://github.com/saltstack/salt/blob/master/salt/transport/tcp.py#L905, https://github.com/saltstack/salt/blob/v3004/salt/transport/tcp.py#L1512)
The stream is read and the data is processed (https://github.com/saltstack/salt/blob/master/salt/transport/tcp.py#L885, https://github.com/saltstack/salt/blob/v3004/salt/transport/tcp.py#L1482)
The minion is authenticated and added into present dict => while in the previous version this dict was present directly in the transport internals, now it's decoupled and part of server channel presence_callback (https://github.com/saltstack/salt/blob/master/salt/channel/server.py#L701)
When new job is published, the previous logic used to look into presence dictionary, performing dict lookup, while now it's not available to the TCP transport classes

This also brings additional DDOS-type of issue when one can write simple TCP client that just connects to the master without any extra steps (just hanging the connection) -> the TCP client would be accounted for in the publish loop, instead of relying on the presence dictionary.

Setup
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
Any new Salt Master using TCP transport.

Please be as specific as possible and give set-up details.

on-prem machine
VM (Virtualbox, KVM, etc. please specify)
VM running on a cloud service, please be explicit and add details
container (Kubernetes, Docker, containerd, etc. please specify)
or a combination, please be explicit
jails if it is FreeBSD

Steps to Reproduce the behavior
See above

Expected behavior
Publish of single job on master with 10k minions doesn't take dozens of seconds.

Screenshots
n/a

Versions Report
n/a, latest master branch

Additional context
n/a

The text was updated successfully, but these errors were encountered:

lukasraska · 2022-02-16T15:20:56Z

I'm willing to write PR to address these changes, as discussed on Slack with @waynew but I would definitely need some directions from Salt engineering team (FYI @dwoz ) as any of the options I can think of are not ideal.

The easiest option would be to change the topic_lst to set and change the loop logic to loop through clients instead and perform set lookup over the id. While this decreases the total amount of time in my tests in about a half, still far from ideal.

Second option would be to pass the present dict to the publish_payload from server channel to TCP transport and retain the previous logic. But that does require this is passed for all transports.

Third option would be to wrap around the present callback and essentially maintain copy of the present dict in TCP transport. Whle this is independent on other transports, it brings additional memory load to the master process.

What do you think?

dwoz · 2022-02-25T03:23:05Z

@lukasraska Here is a lightly tested patch. It should get you pointed in the right direction.

0001-TCP-transport-optimization.patch.txt

lukasraska added Bug broken, incorrect, or confusing behavior needs-triage labels Feb 16, 2022

github-actions bot assigned twangboy Feb 16, 2022

dwoz removed the needs-triage label Feb 24, 2022

dwoz self-assigned this Feb 24, 2022

dwoz added the Phosphorus v3005.0 Release code name and version label Feb 24, 2022

twangboy added this to the Chlorine v3007.0 milestone Jun 9, 2023

darkpixel mentioned this issue Jun 21, 2024

[BUG] intermittent connection between master and minion #65265

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Substantial performance degradation on TCP transport after transport refactor #61656

[BUG] Substantial performance degradation on TCP transport after transport refactor #61656

lukasraska commented Feb 16, 2022

lukasraska commented Feb 16, 2022

dwoz commented Feb 25, 2022

[BUG] Substantial performance degradation on TCP transport after transport refactor #61656

[BUG] Substantial performance degradation on TCP transport after transport refactor #61656

Comments

lukasraska commented Feb 16, 2022

lukasraska commented Feb 16, 2022

dwoz commented Feb 25, 2022