-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reactor/orchestrate race condition on salt['pillar.get'] #23373
Comments
@tnypex, thanks for the report. |
My first thought here is that we might be hitting an issue with threading in the reactor, though it doesn't quite seem clear how that might be affecting things. We'll need to carefully step through the code here to see what's happening. |
@tnypex There should be some lines in your debug log that start with |
Here they are:
|
I don't really know what I am doing :), but it looks like the following fixes this issue.
|
Depends.enforce_dependency() class method fires unsuccessfully. There appears to be no synchronization within the Depends decorator class wrt the class global dependency_dict which results in incomplete population of any loader instantiation occuring at the time of one of these exceptions. This would mitigate the immediate affects of saltstack#23839 and saltstack#23373.
@jacksontj We've tracked this down to a race condition due to your addition of threading in the reactor system. @DmitryKuzmenko's evaluation is below:
@jacksontj Do you have any ideas on how to make this work with pillar? We may have to revert your threading work there, or at least config-gate it. |
We're expecting the best workaround will actually be to just implement a multiprocessing implementation next to the threading implementation, and have a config value to choose (defaulting to multiprocessing). |
Hmm, that would work but we'll probably need another way around this as well. With the move to tornado-- the plan is to make more things happen concurrently (using coroutines). This means that more things will be running in parallel within the same process-- not to mention the fact that we support threads (multiprocessing = False). So we'll need to figure this one out.. although this is a fun one ;) We've actually run into this problem in a few different variants recently. The basic issue is we are injecting a bunch of metadata specific to that run into the globals() space-- which means the next caller (or in this case a concurrent caller) gets access to those as well. Ideally we'd have some sort of per-request storage space (similar to flask's request context). From a little looking around it seems that tornado already has this functionality more-or-less builtin. I found a gists which shows something we can mess with. I will probably have some time to play around with this today. It seems to all work fine-- although it looks like it might be an API change (although I have some ideas for some magic to make it compatible)-- we'll see. That being said-- making a processpool option should be very simple-- note that it will have different performance characteristics (today it can use 1 core, in processpool it can use up to N). |
I know @DmitryKuzmenko has a fix for this he's testing, but I don't know the details yet... |
I've actually been hacking most of the day on a RequestContext thing. It won't be complete today, but I should be able to push up some code to get some feedback. |
@basepi I have a branch where I'm messing around with this (https://github.com/jacksontj/salt/tree/concurrent) it seems to be working fine. I still have to add tests-- and make it so in the reactor, but its making progress :) |
My approach is close to the @jacksontj one, but I hard nothing about Tornado's contexts and implemented own DictWrapper that maps Thread name to actual dict data. Also my PR contains two bugfixes. One is for salt-master, it doesn't always properly exitst on break signal. The second one is for highstate outputter. The issue is reproduced with the described case. Outputter expects highstate data contain only key to dict mapping, but gets |
Also it's a good question: if we'll have one of this approach fixing threading behavior do we need a multiprocessing reactor implementation? |
Inspired by saltstack#23373 The basic issue we ran into is that the loader is injecting globals directly into the global namespace. This means that these injected values are not thread or coroutine safe-- meaning we can never do more than one thing at a time. Instead of multiprocessing everything to death-- we can simply use a stack_context to nicely handle this down in the core. As far as the module authors/users are concerned nothing has changed-- but the storage behind the scenes is now per-JID. This same set of classes can easily be used to store additional data (next candidates are reactors, master MWorker tasks, etc.).
So now that we have the ContexetDict implementation merged, this should be resolved I believe. @tnypex Are you willing to test this at the head of develop? |
We didn't hear back here, so I'm going to go ahead and close this since it does appear to be fixed. If not, please leave a comment here and we'll happily re-open it. |
Hi, Sorry for the delay, I tested from develop but the issue is still present in orchestrate runners started by the reactor. |
@jacksontj will you look at this or I should take care? |
From my local testing I can't reproduce the same changed size during iteration error, but I do run into some issues getting your SLSs to run. The first one seems to be an oddity with pillar.get from jinja. For example, with the following SLS:
if "target_server" is not in the pillar, I'd expect it to come back as None or empty string-- but for some reason its coming back as an empty dict and I'm getting exceptions like:
Although that looks ugly-- its not actually whats broken. After looking some more I see that the minion is throwing an exception:
Which means its choking on the pub job. I added some prints and it seems that the pub job looks like:
And 'ret' is definitely not valid (its supposed to be returners in the pub job, not the return). After tracking that down I found that there was some variable shadowing in states.saltmod (d968d69). I've opened a PR with fixes (#29397) can you try with that? And if you do still see issues, mind pasting logs/errors/tracebacks as well? |
Since we have a fix in and no reply, I am closing this. If it turns out not to be fixed, please comment and we will re-open this. |
Hi, Sorry I should have been more precise on the issue that remains, from the develop branch I still experience mixed up pillar data in orchestrate runners started by the reactor (no more tracebacks). the reactor sls is:
and the orchestrate sls:
When, run test.ping against multiple minions
|
@tnypex I'm thinking that since we've fixed part of this issue we should open a new issue for the remaining piece and link back to this issue, so "what's left" is clearer. Thoughts? |
Due to changes in saltstack#23373, we need to explicitely pass in the grains and pillar we have generated.
Hi all,
with
2015.2.0rc2
I have the following reactor sls that calls an orchestrate runner:
When running test.ping on multiple minions (3 in this case)
the reactor sls is rendered fine 3 times :
however, the orchestrate runners will not always get the correct pillar data (it will sometimes get an empty value, sometimes an other orchestrate runner's pillar data, sometimes the expected value) :
The orchestrate sls:
There are also sometimes some
RuntimeError: Set changed size during iteration
exceptions raised such as:The text was updated successfully, but these errors were encountered: