Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot clean up corrupted cache when pillar_cache_backend=disk #62527

Closed
1 of 6 tasks
rgeoghegan opened this issue Aug 24, 2022 · 7 comments
Closed
1 of 6 tasks
Assignees
Labels
Bug broken, incorrect, or confusing behavior Pillar severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around

Comments

@rgeoghegan
Copy link

rgeoghegan commented Aug 24, 2022

Description
If I use the pillar_cache_backend: "disk" config option, and the on-disk msgpack file for a minion gets corrupted, the pillar is now blank, and any attempt to run pillar.clear_pillar_cache crashes, even after restarting the salt-master.

Setup

I am using salt 3004.1 from the yum repo:

[root@saltmaster /]# yum info salt-master
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
 * base: mirror.its.dal.ca
 * extras: centos.les.net
 * updates: mirror.its.dal.ca
Installed Packages
Name        : salt-master
Arch        : noarch
Version     : 3004.2
Release     : 1.el7
Size        : 3.2 M
Repo        : installed
From repo   : salt-3004-repo
Summary     : Management component for salt, a parallel remote execution system
URL         : http://saltstack.org/
License     : ASL 2.0
Description : The Salt master is the central server to which all minions connect.
            : Supports Python 3.

I setup a system with a master and a minion with one pillar file:

my_pillar.sls

my_pillar:
  salt_rules: "rules"
  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container => To show the bug, I am using a master and a minion in docker-compose running centos7 docker images
  • or a combination, please be explicit
  • jails if it is FreeBSD

Steps to Reproduce the behavior

I start with my pillar working as expected:

[root@saltmaster /]# salt \* pillar.items
saltminion:
    ----------
    my_pillar:
        ----------
        salt_rules:
            rules

I add stuff to the pillar file to make it an invalid msgpack file:

[root@saltmaster /]# echo fff >> /var/cache/salt/master/pillar_cache/saltminion

Now my pillar is reported as empty:

[root@saltmaster /]# salt \* pillar.items
saltminion:
    ----------

And the master log has an exception:

[INFO    ] 17:13:09 User root Published command pillar.items with jid 20220824171309872817
[ERROR   ] 17:13:10 Error in function _pillar:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/master.py", line 1917, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python3.6/site-packages/salt/master.py", line 1611, in _pillar
    extra_minion_data=load.get("extra_minion_data"),
  File "/usr/lib/python3.6/site-packages/salt/pillar/__init__.py", line 81, in get_pillar
    pillarenv=pillarenv,
  File "/usr/lib/python3.6/site-packages/salt/pillar/__init__.py", line 408, in __init__
    minion_cache_path=self._minion_cache_path(minion_id),
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 34, in factory
    return CacheDisk(ttl, kwargs["minion_cache_path"], *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 89, in __init__
    self._read()
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 147, in _read
    salt.utils.msgpack.load(fp_, encoding=__salt_system_encoding__)
  File "/usr/lib/python3.6/site-packages/salt/utils/msgpack.py", line 145, in unpack
    return msgpack.unpack(stream, **_sanitize_msgpack_unpack_kwargs(kwargs))
  File "/usr/lib64/python3.6/site-packages/msgpack/__init__.py", line 57, in unpack
    return unpackb(data, **kwargs)
  File "msgpack/_unpacker.pyx", line 209, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.
[INFO    ] 17:13:10 Got return from saltminion for job 20220824171309872817

Running clear_pillar_cache does not work:

[root@saltmaster /]# salt-run pillar.clear_pillar_cache
Exception occurred in runner pillar.clear_pillar_cache: Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/client/mixins.py", line 390, in low
    data["return"] = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/contextvars/__init__.py", line 38, in run
    return callable(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/runners/pillar.py", line 140, in clear_pillar_cache
    __opts__, grains, id_, saltenv, pillarenv=pillarenv
  File "/usr/lib/python3.6/site-packages/salt/pillar/__init__.py", line 408, in __init__
    minion_cache_path=self._minion_cache_path(minion_id),
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 34, in factory
    return CacheDisk(ttl, kwargs["minion_cache_path"], *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 89, in __init__
    self._read()
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 147, in _read
    salt.utils.msgpack.load(fp_, encoding=__salt_system_encoding__)
  File "/usr/lib/python3.6/site-packages/salt/utils/msgpack.py", line 145, in unpack
    return msgpack.unpack(stream, **_sanitize_msgpack_unpack_kwargs(kwargs))
  File "/usr/lib64/python3.6/site-packages/msgpack/__init__.py", line 57, in unpack
    return unpackb(data, **kwargs)
  File "msgpack/_unpacker.pyx", line 209, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.
[root@saltmaster /]# echo $?
0
[root@saltmaster /]# salt \* pillar.items
saltminion:
    ----------
[root@saltmaster /]#

And all this behaviour persists even if the salt-master is restarted.

If I delete the cache file, everything returns to normal:

[root@saltmaster /]# rm -f /var/cache/salt/master/pillar_cache/saltminion
[root@saltmaster /]# salt \* pillar.items
saltminion:
    ----------
    my_pillar:
        ----------
        salt_rules:
            rules

Expected behavior

IMHO, an unreadable cache file should be treated as a missing cache, and just cause the pillar to be rebuilt.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
[root@saltmaster /]# salt --versions-report
Salt Version:
          Salt: 3004.2

Dependency Versions:
          cffi: Not Installed
      cherrypy: Not Installed
      dateutil: Not Installed
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.11.1
       libgit2: Not Installed
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: Not Installed
  pycryptodome: Not Installed
        pygit2: Not Installed
        Python: 3.6.8 (default, Nov 16 2020, 16:55:22)
  python-gnupg: Not Installed
        PyYAML: 3.13
         PyZMQ: 17.0.0
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.1.4

System Versions:
          dist: centos 7 Core
        locale: UTF-8
       machine: x86_64
       release: 5.10.104-linuxkit
        system: Linux
       version: CentOS Linux 7 Core
@rgeoghegan rgeoghegan added Bug broken, incorrect, or confusing behavior needs-triage labels Aug 24, 2022
@welcome
Copy link

welcome bot commented Aug 24, 2022

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@rgeoghegan
Copy link
Author

@Ch3LL Hi! I was on the salt community call last week, and I promised to file the bug I was trying to describe.

What I could also do is submit a patch which just wraps the msgpack reading thing with a try:...except: and treat any relevant msgpack, file-not-found, etc exception the same as 'the file is missing', which should cause the proper cache to be rebuilt and saved properly.

@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 29, 2022

Looks like I'm able to replicate this. If you submit a PR, I will be more than willing to review and test it. I haven't gone into the code yet, but I will when you submit the PR and make sure its the correct fix.

@Ch3LL Ch3LL added this to the Sulphur v3006.0 milestone Aug 29, 2022
@Ch3LL Ch3LL self-assigned this Aug 29, 2022
@Ch3LL Ch3LL added Pillar severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around and removed needs-triage labels Aug 29, 2022
@rgeoghegan rgeoghegan mentioned this issue Sep 27, 2022
2 tasks
@rgeoghegan
Copy link
Author

FYI it took a while because I was out on vacation, but I just put up the PR.

@dwoz
Copy link
Contributor

dwoz commented Sep 28, 2022

@rgeoghegan Do we know the reason this file is getting corrupted in the first place?

@rgeoghegan
Copy link
Author

@dwoz Nothing is specifically corrupting the file, but I was playing with clearing the pillar cache file by just deleting it, and noticed a race condition in the code (along with this bug), and saw that if the file is corrupted, there is no way to recover other than manually deleting the disk cache file.

@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 2, 2023

Closed by #62760

@Ch3LL Ch3LL closed this as completed Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Pillar severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

No branches or pull requests

4 participants