-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wonky loader syndrome #23839
Comments
I threw the external pillar exception in there just because.. .well, I just noticed that it was having the problem and it has been running without failure for several weeks--thought I may as well in case it provides some additional insights. |
As a follow-up, after having observed the loader seeming to not consistently keep track of external modules of various types, I had the idea of linking them into the installed salt tree to see if the problem cleared up.
Putting them in place to be picked up as an internal module did help somewhat. It changed the statistics to only successfully running the orc state from only the first four to succeeding for all but the final four. |
I've been staring at loader code and the Depends decorator. I understand the goal but without properly synchronizing updates to the dependency_dict it causes the above error. The above commit doesn't attempt to fix the actual problem, but it does make it such that modules.(archive|oracle) and test.missing_func() can't hamstring the loader. |
Depends.enforce_dependency() class method fires unsuccessfully. There appears to be no synchronization within the Depends decorator class wrt the class global dependency_dict which results in incomplete population of any loader instantiation occuring at the time of one of these exceptions. This would mitigate the immediate affects of saltstack#23839 and saltstack#23373.
Interesting failure... I was looking through the code and that set definitely isn't mutated during the loop, so I'm very interested in how its changing size during the iterations... Especially since it specifically makes a copy of the set before iterating over it... Do any of your modules spawn background threads or anything? If not, is it possible to make a simpler reproduction case? |
No thread/proc spawning other than what happens normally during reactor An additional condition I observed was that the loader's failure threshold
|
They are all pre-baked, but its possible that they are all actioning on the same memory space (causing your race). Can you try setting the number of reactor threads ( |
I will give that a shot over the next couple days and get back to you with On Thu, Jun 4, 2015 at 10:35 AM, Thomas Jackson [email protected]
You know, I used to think it was awful that life was so unfair. Then I |
Just wanted to let you know that this is still on my radar--I'll hopefully have some time to get to this within the week. |
I am wondering if this got addressed with the introduction of ContextDict |
high level:
12-15 cloud instance minions
custom execution module: tls2 (virtual: tls_ca)
custom runner module: tls_ca_run (virtual: tls_ca)
salt-cloud -dy [list of 6+ minions]
The first 4-5 are destroyed, the event is caught, the reactor is triggered, the orchestration states are rendered and processed successfully.
Then... this.
versions
In the ext pillar calls,
_s
is just an alias for__salt__
tls_ca is a custom execution module
The reactor
The orchestration state
The runner
both the runner and external pillar function perfectly except when I give salt-cloud more than a handful of instances to destroy at a time. 2015.5.0 performs identically to this sort of run. 2014.7.5 also exhibits loader failures under the same context--the only difference is that it merely complains that the external module functions do not exist rather than the set size change.
I am at a loss. If it would help, I'll gzip a debug log from a failed run and toss it up somewhere.
I was looking at the possibility of just queueing the minion_ids of the destroyed instances but looks like the only interface for automatically processing them (queue runner) just wants to dump events back on the bus--I'm thinking it would experience the same difficulty.
I thought this was such a simple state that it would be easily taken care of from the reactor; however, when I saw the same sorts of problems, I imagined it to be that it needed to be passed off for orchestration.
I started developing this process using 2014.7.4 then moved to 2015.5.0 when its release was announced. The above logs fragments are from 2015.5 (from about 3 hours ago). I'll probably go back to 2014.7.5. Either way the result is the same.
It is an important part of the process I've been building that these certificates be revoked and removed very quickly after instance destruction. They contain multiple IP addresses in their subjAltName field--a minion with the same name with an old certificate would just cause all sorts of confusion.
The next step are (zomg) cron-triggered shell scripts. Kinda makes me want to cry when I think about it. Any assistance that can be provided will be quite welcome.
The text was updated successfully, but these errors were encountered: