Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wonky loader syndrome #23839

Closed
gladiatr72 opened this issue May 18, 2015 · 9 comments
Closed

wonky loader syndrome #23839

gladiatr72 opened this issue May 18, 2015 · 9 comments
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix P3 Priority 3 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Milestone

Comments

@gladiatr72
Copy link
Contributor

high level:

12-15 cloud instance minions
custom execution module: tls2 (virtual: tls_ca)
custom runner module: tls_ca_run (virtual: tls_ca)

salt-cloud -dy [list of 6+ minions]

The first 4-5 are destroyed, the event is caught, the reactor is triggered, the orchestration states are rendered and processed successfully.

Then... this.

versions

           Salt: 2015.5.0-401-g3df3b85
         Python: 2.7.5 (default, Jun 17 2014, 18:11:42)
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
 msgpack-python: 0.4.6
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: 1.4.2
         PyYAML: 3.10
          ioflo: 1.2.1
          PyZMQ: 14.3.1
           RAET: 0.6.3
            ZMQ: 3.2.5
           Mako: Not Installed

In the ext pillar calls, _s is just an alias for __salt__

tls_ca is a custom execution module

2015-05-17 23:11:10,787 [salt.pillar      ][ERROR   ][16125] Failed to load ext_pillar tls_gen: Set changed size during iteration
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/pillar/__init__.py", line 534, in ext_pillar
    key)
  File "/usr/lib/python2.7/site-packages/salt/pillar/__init__.py", line 493, in _external_pillar_data
    ext = self.ext_pillars[key](self.opts['id'], pillar, **val)
  File "/srv/ext_modules/pillar/tls_gen.py", line 135, in ext_pillar
    ca_opts = _s['tls_ca.config']()
  File "/usr/lib/python2.7/site-packages/salt/utils/lazy.py", line 85, in __getitem__
    if self._load(key):
  File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1102, in _load
    ret = _inner_load(mod_name)
  File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1091, in _inner_load
    if self._load_module(name) and key in self._dict:
  File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1068, in _load_module
    Depends.enforce_dependencies(self._dict, self.tag)
  File "/usr/lib/python2.7/site-packages/salt/utils/decorators/__init__.py", line 81, in enforce_dependencies
    for module, func, fallback_function in dependent_set:
RuntimeError: Set changed size during iteration
2015-05-17 23:11:10,790 [salt.state       ][INFO    ][16125] Loading fresh modules for state activity
2015-05-17 23:11:10,788 [salt.pillar      ][ERROR   ][16125] Failed to load ext_pillar tls_gen: Set changed size during iteration
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/pillar/__init__.py", line 534, in ext_pillar
    key)
  File "/usr/lib/python2.7/site-packages/salt/pillar/__init__.py", line 493, in _external_pillar_data
    ext = self.ext_pillars[key](self.opts['id'], pillar, **val)
  File "/srv/ext_modules/pillar/tls_gen.py", line 135, in ext_pillar
    ca_opts = _s['tls_ca.config']()
  File "/usr/lib/python2.7/site-packages/salt/utils/lazy.py", line 85, in __getitem__
    if self._load(key):
  File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1102, in _load
    ret = _inner_load(mod_name)
  File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1091, in _inner_load
    if self._load_module(name) and key in self._dict:
  File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1068, in _load_module
    Depends.enforce_dependencies(self._dict, self.tag)
  File "/usr/lib/python2.7/site-packages/salt/utils/decorators/__init__.py", line 81, in enforce_dependencies
    for module, func, fallback_function in dependent_set:
RuntimeError: Set changed size during iteration


015-05-17 23:11:05,622 [salt.state       ][INFO    ][16125] Exception occurred in runner tls_ca.revoke_certificate: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/client/mixins.py", line 317, in low
    data['return'] = self.functions[fun](*args, **kwargs)
  File "/srv/ext_modules/runners/tls_ca_run.py", line 225, in revoke_certificate
    ca_list = fn['tls_ca.config']().get('authorities').keys()
  File "/usr/lib/python2.7/site-packages/salt/utils/lazy.py", line 85, in __getitem__
    if self._load(key):
  File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1102, in _load
    ret = _inner_load(mod_name)
  File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1091, in _inner_load
    if self._load_module(name) and key in self._dict:
  File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1068, in _load_module
    Depends.enforce_dependencies(self._dict, self.tag)
  File "/usr/lib/python2.7/site-packages/salt/utils/decorators/__init__.py", line 81, in enforce_dependencies
    for module, func, fallback_function in dependent_set:
RuntimeError: Set changed size during iteration

The reactor

syncit:
  runner.state.orchestrate:
    - mods: orc.decommision
    - pillar:
        id: {{data.name}}

The orchestration state

{% set target_expr = pillar.id %}
{% set mminion = salt.extutils.master_id() -%}

decom_state_end:
  salt.function:
    - name: event.fire_master
    - tgt: {{salt.extutils.master_id()}}
    - expr_form: compound
    - kwarg:
        data:
          msg: decommission complete
        tag: orc/decommission/{{target_expr}}/complete

mine_clear:
  salt.runner:
    - name: cache.clear_mine
    - tgt: {{target_expr}}
    - expr_form: compound
    - require_in:
      - salt: decom_state_end

revoke_cert:
  salt.runner:
    - name: tls_ca.revoke_certificate
    - tgt: {{target_expr}}
    - require_in:
      - salt: decom_state_end

The runner

def __init__(opts):
    fn = salt.loader.minion_mods(opts)
    setattr(sys.modules[__name__], 'fn', fn)

[...]
def revoke_certificate(tgt=None):
    '''
    takes one argument: target (minion_id)

    This function revokes all certificates on all configured CAs.  Meant
    to be called from a salt/cloud/*/destroyed reactor.
    '''

    retval = {}
    if not tgt:
        return retval

    event = {'changes': {}}

    ca_list = fn['tls_ca.config']().get('authorities').keys()

    for ca in ca_list:
        fire_event = False
        for ctype in ['server', 'client']:
            if fn['tls_ca.get_certificate'](ca, CN=tgt, cert_type=ctype):
                res = fn['tls_ca.revoke_cert'](ca, CN=tgt, cert_type=ctype)
                res = 'deleted'
                fire_event = fire_event | True
            else:
                res = 'not present'

            retval.setdefault('changes', {}).setdefault(
                ca, {}).setdefault(
                tgt, {}).update({ctype: res})

            if fire_event:
                event.setdefault('changes', {}).setdefault(
                    ca, {}).setdefault(
                    tgt, {}).update({ctype: res})

    _displayit(retval, tgt)
    log.debug(event)
    _flingit(event)

    return retval

both the runner and external pillar function perfectly except when I give salt-cloud more than a handful of instances to destroy at a time. 2015.5.0 performs identically to this sort of run. 2014.7.5 also exhibits loader failures under the same context--the only difference is that it merely complains that the external module functions do not exist rather than the set size change.

I am at a loss. If it would help, I'll gzip a debug log from a failed run and toss it up somewhere.

I was looking at the possibility of just queueing the minion_ids of the destroyed instances but looks like the only interface for automatically processing them (queue runner) just wants to dump events back on the bus--I'm thinking it would experience the same difficulty.

I thought this was such a simple state that it would be easily taken care of from the reactor; however, when I saw the same sorts of problems, I imagined it to be that it needed to be passed off for orchestration.

I started developing this process using 2014.7.4 then moved to 2015.5.0 when its release was announced. The above logs fragments are from 2015.5 (from about 3 hours ago). I'll probably go back to 2014.7.5. Either way the result is the same.

It is an important part of the process I've been building that these certificates be revoked and removed very quickly after instance destruction. They contain multiple IP addresses in their subjAltName field--a minion with the same name with an old certificate would just cause all sorts of confusion.

The next step are (zomg) cron-triggered shell scripts. Kinda makes me want to cry when I think about it. Any assistance that can be provided will be quite welcome.

@gladiatr72
Copy link
Contributor Author

I threw the external pillar exception in there just because.. .well, I just noticed that it was having the problem and it has been running without failure for several weeks--thought I may as well in case it provides some additional insights.

@gladiatr72
Copy link
Contributor Author

As a follow-up, after having observed the loader seeming to not consistently keep track of external modules of various types, I had the idea of linking them into the installed salt tree to see if the problem cleared up.

Aaand it does. Not ideal but I can deal with this with my current client by writing a state collection for the master's minion to drop the modules in place and restart the salt-master.

I haven't done a deep dive into the threading/process bits of salt, but it would almost seem as if when the reactor thread is under low to moderate load, and the master is asked to spawn multiple loader instances in parallel (such as when spawning multiple runners to handle the aforementioned mass cloud instance destruction), that the threads being spawned to initialize the loader object are returning before external module loading is complete. It follows my initial observations of the first 3-4 orchestration jobs executing correctly followed by a cascade of failed jobs and loader exceptions wrt to missing module.funcs.

Putting them in place to be picked up as an internal module did help somewhat. It changed the statistics to only successfully running the orc state from only the first four to succeeding for all but the final four.

@gladiatr72
Copy link
Contributor Author

I've been staring at loader code and the Depends decorator. I understand the goal but without properly synchronizing updates to the dependency_dict it causes the above error. The above commit doesn't attempt to fix the actual problem, but it does make it such that modules.(archive|oracle) and test.missing_func() can't hamstring the loader.

rallytime pushed a commit to rallytime/salt that referenced this issue May 20, 2015
Depends.enforce_dependency() class method fires unsuccessfully. There appears
to be no synchronization within the Depends decorator class wrt the
class global dependency_dict which results in incomplete population of any
loader instantiation occuring at the time of one of these exceptions.

This would mitigate the immediate affects of saltstack#23839 and saltstack#23373.
@jfindlay jfindlay added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around Core relates to code central or existential to Salt P3 Priority 3 labels May 20, 2015
@jfindlay jfindlay added this to the Approved milestone May 20, 2015
@jacksontj
Copy link
Contributor

Interesting failure... I was looking through the code and that set definitely isn't mutated during the loop, so I'm very interested in how its changing size during the iterations... Especially since it specifically makes a copy of the set before iterating over it... Do any of your modules spawn background threads or anything?

If not, is it possible to make a simpler reproduction case?

@gladiatr72
Copy link
Contributor Author

No thread/proc spawning other than what happens normally during reactor
job-processing. I haven't dissected the thread pool bits, though--is there
a trigger point where the reactor's thread pool expands or are all of its
threads pre-baked during master initialization?

An additional condition I observed was that the loader's failure threshold
increased when I moved most of the orchestration state and function calls
directly into the reactor state rather than calling them via the
orchestration runner. It still happens; however, the numbers of non-error
runner calls is much higher before the failure state occurs.
On Jun 2, 2015 9:34 AM, "Thomas Jackson" [email protected] wrote:

Interesting failure... I was looking through the code and that set
definitely isn't mutated during the loop, so I'm very interested in how
its changing size during the iterations... Especially since it specifically
makes a copy of the set before iterating over it... Do any of your modules
spawn background threads or anything?

If not, is it possible to make a simpler reproduction case?


Reply to this email directly or view it on GitHub
#23839 (comment).

@jacksontj
Copy link
Contributor

They are all pre-baked, but its possible that they are all actioning on the same memory space (causing your race). Can you try setting the number of reactor threads (reactor_worker_threads) to 1? if that clears it up (which I expect it will) we'll have to come up with a way of making the loader threadsafe-- since it seems that it isn't :)

@gladiatr72
Copy link
Contributor Author

I will give that a shot over the next couple days and get back to you with
the results!

On Thu, Jun 4, 2015 at 10:35 AM, Thomas Jackson [email protected]
wrote:

They are all pre-baked, but its possible that they are all actioning on
the same memory space (causing your race). Can you try setting the number
of reactor threads (reactor_worker_threads) to 1? if that clears it up
(which I expect it will) we'll have to come up with a way of making the
loader threadsafe-- since it seems that it isn't :)


Reply to this email directly or view it on GitHub
#23839 (comment).

You know, I used to think it was awful that life was so unfair. Then I
thought, wouldn't it be much worse if life were fair, and all the terrible
things that happen to us come because we actually deserve them? So, now I
take great comfort in the general hostility and unfairness of the universe.

@gladiatr72
Copy link
Contributor Author

Just wanted to let you know that this is still on my radar--I'll hopefully have some time to get to this within the week.

@cachedout
Copy link
Contributor

I am wondering if this got addressed with the introduction of ContextDict

@cachedout cachedout added the fixed-pls-verify fix is linked, bug author to confirm fix label May 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix P3 Priority 3 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

No branches or pull requests

5 participants