Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] x509.certificate_managed has DoS effect on salt master. #59169

Closed
ichilton opened this issue Dec 18, 2020 · 4 comments · Fixed by #63099
Closed

[BUG] x509.certificate_managed has DoS effect on salt master. #59169

ichilton opened this issue Dec 18, 2020 · 4 comments · Fixed by #63099
Assignees
Labels
Bug broken, incorrect, or confusing behavior

Comments

@ichilton
Copy link

Description

Using x509.certificate_managed to generate certificates works for a few hosts at a time, but not en-mass (and I only have 75 minions, so it's not exactly big scale). If I do a salt run on '*' or on larger groups of minions, salt-master stops responding to minions.

If I remove the x509.certificate_managed state, it all goes back to normal and works fine. This happens even if the minions already have a valid certificate so it's not generating one, but just checking for expiry.

Setup

It's pretty much the setup documented at: https://docs.saltstack.com/en/latest/ref/states/all/salt.states.x509.html

/etc/pki/{{ grains['fqdn'] }}.key:
  x509.private_key_managed:
    - bits: 4096
    - backup: True
    - require:
      - x509: /etc/pki/ca.crt

{% set ips = [] %}

{% for ip in grains['ipv4'] %}
  {% set ips = ips.append('IP:' + ip) %}
{% endfor %}

{% for ip in grains['ipv6'] %}
  {% if ip != '::1' %}
    {% set ips = ips.append('IP:' + ip) %}
  {% endif %}
{% endfor %}

/etc/pki/{{ grains['fqdn'] }}.crt:
  x509.certificate_managed:
    - ca_server: salt-master.mydomain.net
    - signing_policy: mypolicy
    - public_key: /etc/pki/{{ grains['fqdn'] }}.key
    - CN: {{ grains['fqdn'] }}
    - O: MyCorp
    - Email: [email protected]
    - C: GB
    - L: London
    - subjectAltName: '{{ ips|join(',') }}'
    - days_valid: 90
    - days_remaining: 30
    - backup: True
    - require:
      - x509: /etc/pki/{{ grains['fqdn'] }}.key

x509.conf looks like:

x509_signing_policies:
  mypolicy:
    - minions: '*'
    - signing_private_key: /etc/pki/ca.key
    - signing_cert: /etc/pki/ca.crt
    - O: MyCorp
    - Email: [email protected]
    - C: GB
    - L: London
    - basicConstraints: "critical CA:false"
    - keyUsage: "critical keyEncipherment"
    - subjectKeyIdentifier: hash
    - authorityKeyIdentifier: keyid,issuer:always
    - days_valid: 90
    - copypath: /etc/pki/issued_certs/

and /etc/salt/master.d/peer.conf:

peer:
  .*:
    # Sign a certificate with another minion that has a signing certificate.
    # The requesting minion must also be granted access in a signing
    # policy on the minion with the certificate.
    - x509.sign_remote_certificate

Steps to Reproduce the behavior

My salt runs across all minions consistently take ~39s:

root@salt-master:~# time salt --state-output=changes --state_verbose=false -G 'os:Debian' state.highstate
[snip]
real	0m39.309s
user	0m2.165s
sys	0m0.194s

If I add in the /etc/pki/{{ grains['fqdn'] }}.crt state above, it all stops working:

root@salt-master:~# time salt --state-output=changes --state_verbose=false \* state.highstate
[ERROR   ] Message timed out
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

real	3m32.809s
user	0m4.533s
sys	0m0.322s

If I remove that one state, it all returns to normal and works fine again.

Expected behavior

Checking a certificate's validity/expiry should not add significant time/load to a salt run.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
root@salt-master:~# salt --versions-report
Salt Version:
          Salt: 3002.2

Dependency Versions:
          cffi: Not Installed
      cherrypy: Not Installed
      dateutil: 2.7.3
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.10
       libgit2: Not Installed
      M2Crypto: 0.31.0
          Mako: Not Installed
       msgpack: 0.5.6
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: Not Installed
  pycryptodome: 3.6.1
        pygit2: Not Installed
        Python: 3.7.3 (default, Jul 25 2020, 13:03:44)
  python-gnupg: Not Installed
        PyYAML: 3.13
         PyZMQ: 17.1.2
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.1

System Versions:
          dist: debian 10 buster
        locale: UTF-8
       machine: x86_64
       release: 4.19.0-12-amd64
        system: Linux
       version: Debian GNU/Linux 10 buster

Additional context
Add any other context about the problem here.

@ichilton ichilton added the Bug broken, incorrect, or confusing behavior label Dec 18, 2020
@welcome
Copy link

welcome bot commented Dec 18, 2020

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@OrangeDog
Copy link
Contributor

See also #52167

@sagetherage sagetherage added this to the Silicon milestone Feb 11, 2021
@sagetherage sagetherage modified the milestones: Silicon, Approved Aug 9, 2021
@Ch3LL Ch3LL added Sulfur v3006.0 release code name and version and removed Silicon v3004.0 Release code name labels Oct 25, 2021
@Ch3LL Ch3LL modified the milestones: Approved, Sulphur v3006.0 Oct 25, 2021
@jcjones
Copy link

jcjones commented Aug 22, 2022

The basic issue with this is: The Salt CA is asked to sign new certificates, for every managed certificate, on every highstate.

This is aggravated by using a 4096 bit RSA key, which needs substantially more compute resources to use for signing than a 2048 bit, or using ECDSA.

Following the logic:
During a call of state.certificate_managed, whether or not test=true, we call _certificate_is_valid:

salt/salt/states/x509.py

Lines 659 to 661 in 064729c

is_valid, invalid_reason, current_cert_info = _certificate_is_valid(
name, days_remaining, append_certs, **kwargs
)

Similarly, among the first things _certificate_is_valid does is run __salt__["x509.create_certificate"] in test mode, to get required_cert_info to determine whether the certificate profile has changed.

salt/salt/states/x509.py

Lines 503 to 505 in 064729c

required_cert_info = __salt__["x509.create_certificate"](
testrun=True, **cert_spec
)

The module x509.create_certificate, when run on a minion, asks the CA server to sign_remote_certificate , asking the certificate be re-issued:

salt/salt/modules/x509.py

Lines 1447 to 1452 in d5dec79

certs = __salt__["publish.publish"](
tgt=ca_server,
fun="x509.sign_remote_certificate",
arg=salt.utils.data.decode_dict(kwargs, to_str=True),
timeout=30,
)

sign_remote_certificate re-enters the module x509.create_certificate but now in the context of the CA server:

salt/salt/modules/x509.py

Lines 1093 to 1096 in d5dec79

try:
return create_certificate(path=None, text=True, **argdic)
except Exception as except_: # pylint: disable=broad-except
return str(except_)

Now, our kwargs might well have testrun set, but we go all the way through minting a new, trusted certificate before evaluating whether we're in a test run:

salt/salt/modules/x509.py

Lines 1650 to 1669 in d5dec79

cert.sign(
_get_private_key_obj(
kwargs["signing_private_key"],
passphrase=kwargs["signing_private_key_passphrase"],
),
kwargs["algorithm"],
)
if not verify_signature(cert, signing_pub_key=signing_cert):
raise salt.exceptions.SaltInvocationError(
"failed to verify certificate signature"
)
if "testrun" in kwargs and kwargs["testrun"] is True:
cert_props = read_certificate(cert)
cert_props["Issuer Public Key"] = get_public_key(
kwargs["signing_private_key"],
passphrase=kwargs["signing_private_key_passphrase"],
)
return cert_props

Then we unwind.

The basic issue here is that we're always issuing a new certificate by the CA on every highstate, and if they all hit the CA server at once, it gets overloaded. As I said above, that's made worse by using a more secure keysize like 4096.

The key bug here is that the state.certificate_managed actually wants to know whether the certificate profile changed. Right now, it does that by getting a second valid, trusted cert, and looking for meaningful deviations from the certificate on disk.

The solution is for _certificate_is_valid to instead request the certificate profile metadata from the CA server and compare that profile to the certificate on disk.

@OrangeDog OrangeDog added this to the Sulphur v3006.0 milestone Dec 22, 2022
@johncollaros
Copy link

I have experienced this issue myself, and after a log of digging around discovered the --batch-size parameter which can be passed to the salt command, or -batch: xx% in orchestrate .sls files.

This should be recommended in the documentation for x509 module. It solved my issue "ca_server did not respond. Salt master must permit peers to call the sign_remote_certificate function." when x509.certificate_managed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants