-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in Salt loader. #39948
Conversation
At first the change in the loader mechanism caused a few minor problems, but I think I fixed all of them now. At least, all unit tests pass. Unfortunately I cannot get the integration tests working (in general, not specifically for this branch) because I always get lots of "Address already in use" errors. |
If you're getting those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this toward develop, everything looks good, no new threads are being made and the design is well thought out and made to impact a targeted section of the code.
salt/loader.py
Outdated
# when a proxy object has already been injected. | ||
# In most programming languages, double-checked locking is considered | ||
# unsafe when used without explicit memory barries because one might read | ||
# an uninitialized valze. In CPython it is safe due to the global |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course this should be "value". The Z and and the U keys are next to each other on a German keyboard...
I just commited a fix for this typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I make plenty of typos and I am always happy when they are found sooner then later ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean and clear! Very good!
Definitely. I am always horrified when looking at my own code after some time. :-)
|
Go Go Jenkins! |
It appears we are seeing quite a number of test failures with this. :-/ |
@cachedout That's too bad. Unfortunately I cannot get the integration tests running on my test VM. It's a Ubuntu 14.04 LTS (x86_64) VM and I installed the dependencies with
(as described in the docs). Still, I am always getting these "Address already in use" errors. I can see that there are some lingering processes and I can kill them, but when I start the tests again, the problem simply appears again. Running the tests with elevated privileged ( Is it possible that the base system (Ubuntu 14.04) is a bad choice? Maybe I should give the Docker containers a try, but running docker within a VM seems kind of ugly and there is no Docker image for my host system (macOS 10.12)... |
That's strange: I reverted my VM back to a clean image and reinstalled all the dependencies for Salt and voilà, now most tests run fine for the develop branch. There are still a few failures, but nothing compared to before. I guess I previously messed something up in the test VM which caused the problems with the tests. Anyway, I will now run the tests for the thread-safe-loader branch and try to find out what is causing the problems. |
As it turns out the reason for the "Address already in use" errors was not with my setup but with the version of the develop branch on which my branch is based. That is why the problem suddenly disappeared when creating a fresh setup: I was using a newer version of the develop branch. After locally merging my branch with the updated develop branch, I can also successfully run most tests in my branch. Having said this, there are still twelve tests that cause problems in my branch, but work in the develop branch: Error:
Failure:
I am going to investigate why my changes make these tests fail. |
@cachedout I pushed a few more commits that should now solve the problems with the failing tests. This With these changes, only the following tests result in an error or failure (which also happens when running these tests on the develop branch): Errors:
Failures:
The I suspect that the Anyway, all these tests also fail when running them on the develop branch, so these problems are most likely not related to my changes. |
@smarsching We're almost there on this one. Just a few changes to satisfy the automated lint check: https://jenkins.saltstack.com/job/PR/job/salt-pr-lint-n/9629/violations/file/salt/utils/msgpack_compat.py/ |
@smarsching Did you see the above request for lint fixes? Thanks. |
@cachedout Sorry, it took me so long to get back to you. The deadlines of two projects that I am currently working on (stuff I actually get paid for) are moving in quickly, so I do not find a lot of time to take care of my side projects. For some reason the report you linked seems not to be available any longer, but I ran pylint locally, and it reports two problems:
The report about incompatible Python 3 code is simply wrong. I think this happens because I import the The other message about too many return statements is certainly true, but I do not have a good idea on how to refactor this code so that there are less return statements - at least no idea that would result in more readable code in the end. |
salt/utils/msgpack_compat.py
Outdated
|
||
from __future__ import absolute_import | ||
import msgpack as _msgpack_orig | ||
from salt.ext import six as _six |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s0undt3ch Do you know why the linter is throwing this?
************* Module salt.utils.msgpack_compat
utils/msgpack_compat.py:16: [W1699(incompatible-py3-code), ] Incompatible Python 3 code found. Proposed fix:
from __future__ import absolute_import
import msgpack as _msgpack_orig
from salt.ext import six as _six
+from salt.ext import six
# There are certain types which are sequences, but actually represent string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, but this is solvable by doing:
import salt.ext.six as six
This doesn't expose symbols. At least, we haven't any reports that it exposes anything except the module which is ignorable by loader.
I think there is a typo in the last commit: |
Gah! Thank you. Fixing. |
Thanks, this looks better. |
Apologies for leaving this so long. @smarsching we need a rebase here to resolve some merge conflicts and after that I think we should be good to get this merged. |
Just so you know: I have not forgotten about this, I simply have not found the time yet. I hope that I will find some time to do the rebase next week. |
@smarsching Totally understandable. Thanks for the update! |
8928c2b
to
8a5dc34
Compare
@cachedout: I rebased my branch against the current develop branch. All tests seem to run fine locally. |
@smarsching Could you please clean up some of these lint errors? We're getting a lot closer on this one. :] |
I changed Regarding the other three warnings (“… should be placed at the top of the module.”), I am not sure how they can be fixed. These lines already are at the top of the module. The only thing preceding them is the documentation string, and I am pretty sure that string actually has to be the top element (at least it is like this in other modules). |
@rallytime Good catch. It seems like the |
@rallytime I fixed the problems with the SSH test and I think it runs fine now, but the Jenkins build still does not seem to fully succeed. Unfortunately, I cannot tell what is causing the problems. Jenkins says there is "No identified problem", but it still seems to fail. |
@smarsching Thanks for the update. We have a race condition that occasionally presents itself. I think you hit that on your last run. We can re-run the tests later, however a new commit will trigger the run as well. There are some new pylint failures. Can you take a look at those here and fix those up? https://jenkins.saltstack.com/job/PR/job/salt-pr-lint-n/15511/violations/file/salt/utils/thread_local_proxy.py/ It also looks like we are seeing those mysterious "Import must be at the top of the file" in a new file, but we'll have to address those when @s0undt3ch has a chance to comment. |
@rallytime Thanks for pointing out the lint error. Actually, this one was quite embarrassing because it was so obvious that I should have caught it myself. Anyway, I fixed it now, so that the only two remaining lint errors are the "import should be placed on top of the file" ones, where we do not know what is causing them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're seeing some weird pylint failures here, that we can't quite figure out. The one thing I noticed is that the RST in the docstring is invalid, and only in both files that are failing the lint check. Can you make these two requested changes? I can't see why this would make pylint fail, to be honest, but it is an odd coincidence that the invalid RST appears only in the files that aren't passing pylint.
salt/utils/json_compat.py
Outdated
that actually are no subclasses of ``dict`` or ``list``. This compatibility | ||
layer takes care of converting such objects before passing them to ``json``. | ||
|
||
..versionadded:: Oxygen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be a space between the ..
and versionadded
.
salt/utils/msgpack_compat.py
Outdated
compatibility layer takes care of converting such objects before passing them | ||
to ``msgpack``. | ||
|
||
..versionadded:: Oxygen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be a space between the ..
and versionadded
.
@terminalmage I implemented the suggested fix, but for some reason Jenkins seems to be stuck now. The jenkins/PR/salt-pr-lint-n job has been in the "Waiting for status to be reported |
Looks like the hook was missed to run the test. I've updated your branch to the latest develop - let's see how this test run goes. |
@rallytime Thanks, now jenkins actually ran the tests. The question is: How should we proceed: Shall we ignore the warning? Shall we suppress it by putting the corresponding directive into the code? Or do you think there is some way to fix the lint check? |
@smarsching Ah, apologies for not making it around here sooner. It looks like there are quite a few files conflicting here. Any chance you can rebase this and fix those merge conflicts? Then we can get @terminalmage to come back around and address your questions. |
Thank you @smarsching for all this work. I didn't test your changes (and it turned to be extremely hard to reproduce in a lab environment), but it looks like your work will finally solve a very old bug when execution modules randomly disappear: #35443. Thank you! |
Sorry, I arrived late to the party. The lint reports are now gone. |
081d981
to
ad3a7be
Compare
There was a race condition in the salt loader when injecting global values (e.g. "__pillar__" or "__salt__") into modules. One effect of this race condition was that in a setup with multiple threads, some threads may see pillar data intended for other threads or the pillar data seen by a thread might even change spuriously. There have been earlier attempts to fix this problem (saltstack#27937, saltstack#29397). These patches tried to fix the problem by storing the dictionary that keeps the relevant data in a thread-local variable and referencing this thread-local variable from the variables that are injected into the modules. These patches did not fix the problem completely because they only work when a module is loaded through a single loader instance only. When there is more than one loader, there is more than one thread-local variable and the variable injected into a module is changed to point to another thread-local variable when the module is loaded again. Thus, the problem resurfaced while working on saltstack#39670. This patch attempts to solve the problem from a slightly different angle, complementing the earlier patches: The value injected into the modules now is a proxy that internally uses a thread-local variable to decide to which object it points. This means that when loading a module again through a different loader (possibly passing different pillar data), the data is actually only changed in the thread in which the loader is used. Other threads are not affected by such a change. This means that it will work correctly in the current situation where loaders are possibly created by many different modules and these modules do not necessary know in which context they are executed. Thus it is much more flexible and reliable than the more explicit approach used by the two earlier patches.
The salt.utils.json module now takes care of unwrapping objects that are proxied using the ThreadLocalProxy. The salt.utils.msgpack module has been added and basically provides the same functions as the salt.utils.json module, but for msgpack. Like the json module, it takes care of unwrapping proxies.
ad3a7be
to
97ce1e5
Compare
@rallytime |
Hi @smarsching my apologies for the delay here. Thank you for the update - we definitely want to get this in. Currently there are some merge conflicts here and after discussing this with @cachedout, we'd also like to get this into the oxygen release cycle (current develop will be However, that means we need this PR submitted to the oxygen branch instead of develop. Would you be willing to resubmit there? Then we can do another round of reviews since you've needed to change some things since this was originally submitted, make sure this will work OK with our unicode updates, get a better test run, etc. Please let me know if this is something you're able to do or if you have any questions or concerns. Thanks! :) |
@rallytime I rebased my changes against the oxygen branch and created the new PR #45782. |
@smarsching Thank you so much! I'll close this one out so we can review this all in one place. |
What does this PR do?
There was a race condition in the salt loader when injecting global values (e.g. "pillar" or "salt") into modules. One effect of this race condition was that in a setup with multiple threads, some threads may see pillar data intended for other threads or the pillar data seen by a thread might even change spuriously.
There have been earlier attempts to fix this problem (#27937, #29397). These patches tried to fix the problem by storing the dictionary that keeps the relevant data in a thread-local variable and referencing this thread-local variable from the variables that are injected into the modules.
These patches did not fix the problem completely because they only work when a module is loaded through a single loader instance only. When there is more than one loader, there is more than one thread-local variable and the variable injected into a module is changed to point to another thread-local variable when the module is loaded again. Thus, the problem resurfaced while working on #39670.
This patch attempts to solve the problem from a slightly different angle, complementing the earlier patches: The value injected into the modules now is a proxy that internally uses a thread-local variable to decide to which object it points. This means that when loading a module
again through a different loader (possibly passing different pillar data), the data is actually only changed in the thread in which the loader is used. Other threads are not affected by such a change.
This means that it will work correctly in the current situation where loaders are possibly created by many different modules and these modules do not necessary know in which context they are executed. Thus it is much more flexible and reliable than the more explicit approach
used by the two earlier patches.
What issues does this PR fix or reference?
This PR fixes problems that surfaced when developing the parallel runners feature (#39670), but is also related to problems reported earlier (#23373). The problems described in #29028 might also be related, though this needs further confirmation.
Previous Behavior
Changes to pillar data in one thread might have influenced pillar data in a different thread. Whether or when this problem actually appeared was unpredictable (due to the nature of a race condition).
New Behavior
Changes to pillar data in one thread will never affect the pillar data in a different thread.
Tests written?
No
Regression Potential
The change to the loader code itself is rather small, but it affects a central component. So if there is a problem, the impact might affect nearly all components. For this reason I dubmit this PR for the develop branch, even though it addresses a bug that is actually also present in older releases.
Having said this, I believe that these changes should not cause any problems because to single-threaded processes, everything behaves as before, and for multi-threaded processes, the behavior is now more predictable. I tested this patch on our Salt master (I backported it to 2011.3 for this purpose) and I did not experience any issues there.