Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in Salt loader. #39948

Closed
wants to merge 3 commits into from

Conversation

smarsching
Copy link
Contributor

What does this PR do?

There was a race condition in the salt loader when injecting global values (e.g. "pillar" or "salt") into modules. One effect of this race condition was that in a setup with multiple threads, some threads may see pillar data intended for other threads or the pillar data seen by a thread might even change spuriously.

There have been earlier attempts to fix this problem (#27937, #29397). These patches tried to fix the problem by storing the dictionary that keeps the relevant data in a thread-local variable and referencing this thread-local variable from the variables that are injected into the modules.

These patches did not fix the problem completely because they only work when a module is loaded through a single loader instance only. When there is more than one loader, there is more than one thread-local variable and the variable injected into a module is changed to point to another thread-local variable when the module is loaded again. Thus, the problem resurfaced while working on #39670.

This patch attempts to solve the problem from a slightly different angle, complementing the earlier patches: The value injected into the modules now is a proxy that internally uses a thread-local variable to decide to which object it points. This means that when loading a module
again through a different loader (possibly passing different pillar data), the data is actually only changed in the thread in which the loader is used. Other threads are not affected by such a change.

This means that it will work correctly in the current situation where loaders are possibly created by many different modules and these modules do not necessary know in which context they are executed. Thus it is much more flexible and reliable than the more explicit approach
used by the two earlier patches.

What issues does this PR fix or reference?

This PR fixes problems that surfaced when developing the parallel runners feature (#39670), but is also related to problems reported earlier (#23373). The problems described in #29028 might also be related, though this needs further confirmation.

Previous Behavior

Changes to pillar data in one thread might have influenced pillar data in a different thread. Whether or when this problem actually appeared was unpredictable (due to the nature of a race condition).

New Behavior

Changes to pillar data in one thread will never affect the pillar data in a different thread.

Tests written?

No

Regression Potential

The change to the loader code itself is rather small, but it affects a central component. So if there is a problem, the impact might affect nearly all components. For this reason I dubmit this PR for the develop branch, even though it addresses a bug that is actually also present in older releases.

Having said this, I believe that these changes should not cause any problems because to single-threaded processes, everything behaves as before, and for multi-threaded processes, the behavior is now more predictable. I tested this patch on our Salt master (I backported it to 2011.3 for this purpose) and I did not experience any issues there.

@smarsching
Copy link
Contributor Author

At first the change in the loader mechanism caused a few minor problems, but I think I fixed all of them now. At least, all unit tests pass. Unfortunately I cannot get the integration tests working (in general, not specifically for this branch) because I always get lots of "Address already in use" errors.

@cachedout
Copy link
Contributor

If you're getting those Address in Use errors, you most likely have a test runner that didn't shut down completely. Just grep for Python procs on your machine and you'll likely see it very quickly. Just kill it and restart the tests and you should be good.

Copy link
Contributor

@thatch45 thatch45 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this toward develop, everything looks good, no new threads are being made and the design is well thought out and made to impact a targeted section of the code.

salt/loader.py Outdated
# when a proxy object has already been injected.
# In most programming languages, double-checked locking is considered
# unsafe when used without explicit memory barries because one might read
# an uninitialized valze. In CPython it is safe due to the global
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course this should be "value". The Z and and the U keys are next to each other on a German keyboard...
I just commited a fix for this typo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I make plenty of typos and I am always happy when they are found sooner then later ;)

Copy link
Contributor

@DmitryKuzmenko DmitryKuzmenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean and clear! Very good!

@smarsching
Copy link
Contributor Author

smarsching commented Mar 15, 2017 via email

@cachedout
Copy link
Contributor

Go Go Jenkins!

@cachedout
Copy link
Contributor

It appears we are seeing quite a number of test failures with this. :-/

@smarsching
Copy link
Contributor Author

@cachedout That's too bad. Unfortunately I cannot get the integration tests running on my test VM. It's a Ubuntu 14.04 LTS (x86_64) VM and I installed the dependencies with

sudo pip install -r requirements/dev_python27.txt
sudo pip install -r requirements/zeromq.txt

(as described in the docs).

Still, I am always getting these "Address already in use" errors. I can see that there are some lingering processes and I can kill them, but when I start the tests again, the problem simply appears again. Running the tests with elevated privileged (sudo ./tests/runtests.py) does not seem to make a difference.

Is it possible that the base system (Ubuntu 14.04) is a bad choice? Maybe I should give the Docker containers a try, but running docker within a VM seems kind of ugly and there is no Docker image for my host system (macOS 10.12)...
So any hints regarding what I should do differently in my test VM are welcome.

@smarsching
Copy link
Contributor Author

That's strange: I reverted my VM back to a clean image and reinstalled all the dependencies for Salt and voilà, now most tests run fine for the develop branch. There are still a few failures, but nothing compared to before. I guess I previously messed something up in the test VM which caused the problems with the tests.

Anyway, I will now run the tests for the thread-safe-loader branch and try to find out what is causing the problems.

@smarsching
Copy link
Contributor Author

As it turns out the reason for the "Address already in use" errors was not with my setup but with the version of the develop branch on which my branch is based. That is why the problem suddenly disappeared when creating a fresh setup: I was using a newer version of the develop branch. After locally merging my branch with the updated develop branch, I can also successfully run most tests in my branch.

Having said this, there are still twelve tests that cause problems in my branch, but work in the develop branch:

Error:

  • integration.modules.test_pillar.PillarModuleTest.test_data
  • integration.modules.test_pillar.PillarModuleTest.test_ext_cmd_yaml
  • integration.modules.test_pillar.PillarModuleTest.test_issue_5449_report_actual_file_roots_in_pillar
  • integration.modules.test_pillar.PillarModuleTest.test_issue_5951_actual_file_roots_in_opts
  • integration.modules.test_pillar.PillarModuleTest.test_two_ext_pillar_sources
  • integration.modules.test_pillar.PillarModuleTest.test_two_ext_pillar_sources_override
  • integration.modules.test_mine.MineTest.test_mine_delete
  • integration.modules.test_mine.MineTest.test_mine_flush
  • integration.modules.test_mine.MineTest.test_send
  • integration.modules.test_grains.TestModulesGrains.test_items
  • integration.loader.test_ext_grains.LoaderGrainsTest.test_grains_overwrite

Failure:

  • integration.modules.test_pillar.PillarModuleTest.test_pillar_items

I am going to investigate why my changes make these tests fail.

@smarsching
Copy link
Contributor Author

@cachedout I pushed a few more commits that should now solve the problems with the failing tests.
These problems were related to the problem with msgpack serialization that I already fixed earlier. However, my original fix only fixed it in a specific spot and not in general. I now improved this by creating the salt.utils.msgpack_compat module and importing it everywhere instead of the msgpack module.

This msgpack_compat module takes care of converting dict-like objects to real dicts and list-like objects to real lists. This should also help when serializing other classes than ThreadLocalProxy.

With these changes, only the following tests result in an error or failure (which also happens when running these tests on the develop branch):

Errors:

  • integration.modules.test_groupadd.GroupModuleTest.test_add
  • integration.modules.test_groupadd.GroupModuleTest.test_adduser
  • integration.modules.test_groupadd.GroupModuleTest.test_chgid
  • integration.modules.test_groupadd.GroupModuleTest.test_delete
  • integration.modules.test_groupadd.GroupModuleTest.test_deluser
  • integration.modules.test_groupadd.GroupModuleTest.test_getent
  • integration.modules.test_groupadd.GroupModuleTest.test_info
  • integration.modules.test_groupadd.GroupModuleTest.test_members
  • unit.states.test_grains.GrainsTestCase.test_list_present_inexistent_nested
  • unit.modules.test_docker.DockerTestCase.test_check_mine_cache_is_refreshed_on_container_change_event

Failures:

  • integration.runners.test_runner_returns.RunnerReturnsTest.test_runner_returns_enabled
  • integration.reactor.test_reactor.ReactorTest.test_ping_reaction (only fails sporadically)
  • unit.templates.test_jinja.TestCustomExtensions.test_serialize_yaml_unicode

The integration.modules.test_groupadd tests seem to result in an error because I have disabled destructive tests.

I suspect that the unit.states.test_grains.GrainsTestCase.test_list_present_inexistent_nested might have a similar problem as the one I had with the ThreadSafeProxy (the YAML serializer complains that it cannot serialize a defaultdict object which might be because a defaultdict is not a dict), but I did not investigate further.

Anyway, all these tests also fail when running them on the develop branch, so these problems are most likely not related to my changes.

@cachedout
Copy link
Contributor

@smarsching We're almost there on this one. Just a few changes to satisfy the automated lint check: https://jenkins.saltstack.com/job/PR/job/salt-pr-lint-n/9629/violations/file/salt/utils/msgpack_compat.py/

@cachedout
Copy link
Contributor

@smarsching Did you see the above request for lint fixes? Thanks.

@smarsching
Copy link
Contributor Author

@cachedout Sorry, it took me so long to get back to you. The deadlines of two projects that I am currently working on (stuff I actually get paid for) are moving in quickly, so I do not find a lot of time to take care of my side projects.

For some reason the report you linked seems not to be available any longer, but I ran pylint locally, and it reports two problems:

************* Module salt.utils.msgpack_compat
W: 16, 0: Incompatible Python 3 code found. Proposed fix:
 from __future__ import absolute_import
 import msgpack as _msgpack_orig
 from salt.ext import six as _six
+from salt.ext import six
 
 
 # There are certain types which are sequences, but actually represent string (i
ncompatible-py3-code)
R: 23, 0: Too many return statements (14/6) (too-many-return-statements)

The report about incompatible Python 3 code is simply wrong. I think this happens because I import the six module as _six. I do this to avoid having any public symbols in the module that might hide symbols from the original msgpack module. Anyway, pylint seems to be wrong because I tested the code with both Python 2 and 3 and it seems to be fine.

The other message about too many return statements is certainly true, but I do not have a good idea on how to refactor this code so that there are less return statements - at least no idea that would result in more readable code in the end.


from __future__ import absolute_import
import msgpack as _msgpack_orig
from salt.ext import six as _six
Copy link
Contributor

@cachedout cachedout Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s0undt3ch Do you know why the linter is throwing this?

************* Module salt.utils.msgpack_compat
utils/msgpack_compat.py:16: [W1699(incompatible-py3-code), ] Incompatible Python 3 code found. Proposed fix:
 from __future__ import absolute_import
 import msgpack as _msgpack_orig
 from salt.ext import six as _six
+from salt.ext import six


 # There are certain types which are sequences, but actually represent string

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, but this is solvable by doing:

import salt.ext.six as six

This doesn't expose symbols. At least, we haven't any reports that it exposes anything except the module which is ignorable by loader.

@smarsching
Copy link
Contributor Author

I think there is a typo in the last commit:
if gsix.PY3: in line 82 should be if six.PY3.

@cachedout
Copy link
Contributor

Gah! Thank you. Fixing.

@smarsching
Copy link
Contributor Author

Thanks, this looks better.
BTW: When I said that using six instead of _six might hide symbols, I meant that when msgpack had a symbol six (be it a function or other module), this symbol six would be hidden when people used msgpack_compat instead of msgpack. As msgpack currently does not have such a symbol and most likely will never have it either, I am fine with this change.

@cachedout
Copy link
Contributor

Apologies for leaving this so long. @smarsching we need a rebase here to resolve some merge conflicts and after that I think we should be good to get this merged.

@smarsching
Copy link
Contributor Author

Just so you know: I have not forgotten about this, I simply have not found the time yet. I hope that I will find some time to do the rebase next week.

@cachedout
Copy link
Contributor

@smarsching Totally understandable. Thanks for the update!

@smarsching
Copy link
Contributor Author

@cachedout: I rebased my branch against the current develop branch. All tests seem to run fine locally.

@cachedout
Copy link
Contributor

@smarsching Could you please clean up some of these lint errors? We're getting a lot closer on this one. :]

https://jenkins.saltstack.com/job/PR/job/salt-pr-lint-n/11299/violations/file/salt/utils/msgpack_compat.py/

@smarsching
Copy link
Contributor Author

I changed from salt.ext import six as _six to from salt.ext import six. While the original version was also compatible with Python 3, Pylint was unable to determine that fact (it seems like it only looks for the term six instead of analyzing the actual logic).

Regarding the other three warnings (“… should be placed at the top of the module.”), I am not sure how they can be fixed. These lines already are at the top of the module. The only thing preceding them is the documentation string, and I am pretty sure that string actually has to be the top element (at least it is like this in other modules).

@smarsching
Copy link
Contributor Author

@rallytime Good catch. It seems like the json module has the same problems that the msgpack module has when serializing proxied objects. I am going to use the same workaround that I implemented for msgpack (create a json_compat module that unproxies the objects before passing them to the json module). However, I am probably not going to find the time to do this before next week.

@smarsching
Copy link
Contributor Author

@rallytime I fixed the problems with the SSH test and I think it runs fine now, but the Jenkins build still does not seem to fully succeed. Unfortunately, I cannot tell what is causing the problems. Jenkins says there is "No identified problem", but it still seems to fail.

@rallytime
Copy link
Contributor

@smarsching Thanks for the update. We have a race condition that occasionally presents itself. I think you hit that on your last run. We can re-run the tests later, however a new commit will trigger the run as well.

There are some new pylint failures. Can you take a look at those here and fix those up? https://jenkins.saltstack.com/job/PR/job/salt-pr-lint-n/15511/violations/file/salt/utils/thread_local_proxy.py/

It also looks like we are seeing those mysterious "Import must be at the top of the file" in a new file, but we'll have to address those when @s0undt3ch has a chance to comment.

@smarsching
Copy link
Contributor Author

@rallytime Thanks for pointing out the lint error. Actually, this one was quite embarrassing because it was so obvious that I should have caught it myself.

Anyway, I fixed it now, so that the only two remaining lint errors are the "import should be placed on top of the file" ones, where we do not know what is causing them.

Copy link
Contributor

@terminalmage terminalmage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're seeing some weird pylint failures here, that we can't quite figure out. The one thing I noticed is that the RST in the docstring is invalid, and only in both files that are failing the lint check. Can you make these two requested changes? I can't see why this would make pylint fail, to be honest, but it is an odd coincidence that the invalid RST appears only in the files that aren't passing pylint.

that actually are no subclasses of ``dict`` or ``list``. This compatibility
layer takes care of converting such objects before passing them to ``json``.

..versionadded:: Oxygen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be a space between the .. and versionadded.

compatibility layer takes care of converting such objects before passing them
to ``msgpack``.

..versionadded:: Oxygen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be a space between the .. and versionadded.

@smarsching
Copy link
Contributor Author

@terminalmage I implemented the suggested fix, but for some reason Jenkins seems to be stuck now. The jenkins/PR/salt-pr-lint-n job has been in the "Waiting for status to be reported
Required" state for hours. Therefore, I do not know whether the change actually fixed the lint problem.

@rallytime
Copy link
Contributor

Looks like the hook was missed to run the test. I've updated your branch to the latest develop - let's see how this test run goes.

@smarsching
Copy link
Contributor Author

@rallytime Thanks, now jenkins actually ran the tests.
@terminalmage I fixed the docstring as suggested by you, but it does not seem to fix the lint warning: https://jenkins.saltstack.com/job/PR/job/salt-pr-lint-n/15597/violations/file/salt/utils/json_compat.py/
Don't worry about the unused import warning I already pushed a fix for that one.

The question is: How should we proceed: Shall we ignore the warning? Shall we suppress it by putting the corresponding directive into the code? Or do you think there is some way to fix the lint check?

@rallytime
Copy link
Contributor

@smarsching Ah, apologies for not making it around here sooner. It looks like there are quite a few files conflicting here. Any chance you can rebase this and fix those merge conflicts? Then we can get @terminalmage to come back around and address your questions.

@mirceaulinic
Copy link
Contributor

mirceaulinic commented Nov 30, 2017

Thank you @smarsching for all this work. I didn't test your changes (and it turned to be extremely hard to reproduce in a lab environment), but it looks like your work will finally solve a very old bug when execution modules randomly disappear: #35443. Thank you!

@s0undt3ch
Copy link
Collaborator

Sorry, I arrived late to the party. The lint reports are now gone.

There was a race condition in the salt loader when injecting global
values (e.g. "__pillar__" or "__salt__") into modules. One effect of
this race condition was that in a setup with multiple threads, some
threads may see pillar data intended for other threads or the pillar
data seen by a thread might even change spuriously.

There have been earlier attempts to fix this problem (saltstack#27937, saltstack#29397).
These patches tried to fix the problem by storing the dictionary that
keeps the relevant data in a thread-local variable and referencing this
thread-local variable from the variables that are injected into the
modules.

These patches did not fix the problem completely because they only
work when a module is loaded through a single loader instance only.
When there is more than one loader, there is more than one
thread-local variable and the variable injected into a module is
changed to point to another thread-local variable when the module is
loaded again. Thus, the problem resurfaced while working on saltstack#39670.

This patch attempts to solve the problem from a slightly different
angle, complementing the earlier patches: The value injected into the
modules now is a proxy that internally uses a thread-local variable to
decide to which object it points. This means that when loading a module
again through a different loader (possibly passing different pillar
data), the data is actually only changed in the thread in which the
loader is used. Other threads are not affected by such a change.

This means that it will work correctly in the current situation where
loaders are possibly created by many different modules and these
modules do not necessary know in which context they are executed. Thus
it is much more flexible and reliable than the more explicit approach
used by the two earlier patches.
The salt.utils.json module now takes care of unwrapping objects that are
proxied using the ThreadLocalProxy.

The salt.utils.msgpack module has been added and basically provides the
same functions as the salt.utils.json module, but for msgpack. Like the
json module, it takes care of unwrapping proxies.
@smarsching
Copy link
Contributor Author

@rallytime
I have rebased the branch instead of merging it because the commit history was already quite messy. As far as I can tell, all lint problems are gone, so you should now be able to merge the branch.

@rallytime
Copy link
Contributor

Hi @smarsching my apologies for the delay here. Thank you for the update - we definitely want to get this in. Currently there are some merge conflicts here and after discussing this with @cachedout, we'd also like to get this into the oxygen release cycle (current develop will be Fluorine). This will be a dot-release improvement in the oxygen release cycle, rather than the first (.0) release of Oxygen.

However, that means we need this PR submitted to the oxygen branch instead of develop. Would you be willing to resubmit there? Then we can do another round of reviews since you've needed to change some things since this was originally submitted, make sure this will work OK with our unicode updates, get a better test run, etc.

Please let me know if this is something you're able to do or if you have any questions or concerns. Thanks! :)

@smarsching
Copy link
Contributor Author

@rallytime I rebased my changes against the oxygen branch and created the new PR #45782.

@rallytime
Copy link
Contributor

@smarsching Thank you so much! I'll close this one out so we can review this all in one place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants