Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Substantial performance degradation on TCP transport after transport refactor #61656

Open
6 tasks
lukasraska opened this issue Feb 16, 2022 · 2 comments
Open
6 tasks
Assignees
Labels
Bug broken, incorrect, or confusing behavior Phosphorus v3005.0 Release code name and version

Comments

@lukasraska
Copy link
Contributor

Description
The big transport refactor (#61450) that was meant to decouple some general logic from transport and easy the implementation of new transport methods brought an unwanted performance degradation to TCP transport.

In particular the performance degradation can be seen when new job is published to specific minion (target_lst) instead of broadcast. The previous logic for publishing job was pretty straightforward (https://github.com/saltstack/salt/blob/v3004/salt/transport/tcp.py#L1517 ):

  1. Check if topic_lst was passed (this contains list of all minions the job is meant for)
  2. Loop through all topics (all minions)
  3. Find appropriate connected clients for the minion (there might be multiple connected for single minion because of reasons)
  4. Write the job to the socket

While after the refactor the logic is mostly the opposite (https://github.com/saltstack/salt/blob/master/salt/transport/tcp.py#L910):

  1. Check if topic_lst was passed
  2. Loop through all topics
  3. Loop through all clients
  4. Check if client matches the topic and if so, write the job to the socket

In practice this means that if there is 10k connected minions and new job should be published to just one single minion, it's immediate (one iteration through topic_lst loop and lookup in dict), while after the refactor the worst case scenario is 10k iterations of the clients loop. If we publish job to multiple minions, this just gets worse. For 5k minions in a job in 10k minion setup, the previous complexity would be mostly equal to number of minions in a job, while now it's number of minions in a job times number of connected clients.

The source of the issue stems from how the clients are looked up and how authentication mechanism works in Salt:

  1. When minion connects, it's added into self.clients set (https://github.com/saltstack/salt/blob/master/salt/transport/tcp.py#L905, https://github.com/saltstack/salt/blob/v3004/salt/transport/tcp.py#L1512)
  2. The stream is read and the data is processed (https://github.com/saltstack/salt/blob/master/salt/transport/tcp.py#L885, https://github.com/saltstack/salt/blob/v3004/salt/transport/tcp.py#L1482)
  3. The minion is authenticated and added into present dict => while in the previous version this dict was present directly in the transport internals, now it's decoupled and part of server channel presence_callback (https://github.com/saltstack/salt/blob/master/salt/channel/server.py#L701)
  4. When new job is published, the previous logic used to look into presence dictionary, performing dict lookup, while now it's not available to the TCP transport classes

This also brings additional DDOS-type of issue when one can write simple TCP client that just connects to the master without any extra steps (just hanging the connection) -> the TCP client would be accounted for in the publish loop, instead of relying on the presence dictionary.

Setup
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
Any new Salt Master using TCP transport.

Please be as specific as possible and give set-up details.

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD

Steps to Reproduce the behavior
See above

Expected behavior
Publish of single job on master with 10k minions doesn't take dozens of seconds.

Screenshots
n/a

Versions Report
n/a, latest master branch

Additional context
n/a

@lukasraska lukasraska added Bug broken, incorrect, or confusing behavior needs-triage labels Feb 16, 2022
@lukasraska
Copy link
Contributor Author

I'm willing to write PR to address these changes, as discussed on Slack with @waynew but I would definitely need some directions from Salt engineering team (FYI @dwoz ) as any of the options I can think of are not ideal.

The easiest option would be to change the topic_lst to set and change the loop logic to loop through clients instead and perform set lookup over the id. While this decreases the total amount of time in my tests in about a half, still far from ideal.

Second option would be to pass the present dict to the publish_payload from server channel to TCP transport and retain the previous logic. But that does require this is passed for all transports.

Third option would be to wrap around the present callback and essentially maintain copy of the present dict in TCP transport. Whle this is independent on other transports, it brings additional memory load to the master process.

What do you think?

@dwoz dwoz removed the needs-triage label Feb 24, 2022
@dwoz dwoz self-assigned this Feb 24, 2022
@dwoz dwoz added the Phosphorus v3005.0 Release code name and version label Feb 24, 2022
@dwoz
Copy link
Contributor

dwoz commented Feb 25, 2022

@lukasraska Here is a lightly tested patch. It should get you pointed in the right direction.

0001-TCP-transport-optimization.patch.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Phosphorus v3005.0 Release code name and version
Projects
None yet
Development

No branches or pull requests

3 participants